# Part 2: Car Factors 

You are to construct a predictive model that provides the duration till sold for a given model.  

1. Populate the carsfactors.py following the hints in the comments
1. Integrate with carfactors_service.py
1. Test locally
1. Build requirements.txt and Dockerfile
1. Build a docker image
1. Test Locally
1. Push to docker hub
1. Populate readme for both github and docker hub (with example docker commands)
1. Populate this notebook with working output and a summary that contains an impression of the model and how to improve it.

* ***Review the [codeSamplesforCategoricalData.ipynb](./codeSamplesforCategoricalData.ipynb) for code review of the categorical data manipulations***.

In [1]:
from carsfactors import carsfactors
cf = carsfactors()

### Test Model first - Get stats

In [2]:
cf.model_stats()

'0.0021029748868419684'

### Get Determination

In [3]:

cf.model_infer(["automatic", "mechanical", "automatic"],["silver", "grey", "white"],["sedan", "hatchback", "minivan"])

<carsfactors.carsfactors object at 0x14f7a29d0> ['sedan', 'hatchback', 'minivan']




'[[1.04107871e+13]\n [1.08934287e+13]\n [2.93971392e+12]]'

### Start up the service

In [4]:
!python carfactors_service.py

starting server...
 * Serving Flask app 'carfactors_service'
 * Debug mode: off
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8080
 * Running on http://172.20.10.3:8080
[33mPress CTRL+C to quit[0m
127.0.0.1 - - [18/Feb/2024 20:53:31] "GET /stats HTTP/1.1" 200 -
REQ:  <Request 'http://127.0.0.1:8080/infer?transmission=automatic&color=blue&bodytype=suv' [GET]> ARGS:  ImmutableMultiDict([('transmission', 'automatic'), ('color', 'blue'), ('bodytype', 'suv')])
<carsfactors.carsfactors object at 0x10dd0bf10> suv
127.0.0.1 - - [18/Feb/2024 20:53:36] "GET /infer?transmission=automatic&color=blue&bodytype=suv HTTP/1.1" 200 -
^C


Try out the links 
* [stats](http://127.0.0.1:8080/stats)
* [determination](http://127.0.0.1:8080/infer?transmission=automatic&color=blue&bodytype=suv)

### You must kill the kernel to try again for the port stays locked to the current kernel

# Summary
* Assignment and Model Results
* Techniques to improve the results

**Assignment and Model Results**

This assignment centers on running a flask app to publish a model predicting duration of cars. The given dataset is a csv file of car information:

In [6]:
import pandas as pd
df = pd.read_csv("cars.csv")
print(df.columns)
df.head()

Index(['manufacturer_name', 'model_name', 'transmission', 'color',
       'odometer_value', 'year_produced', 'engine_fuel', 'engine_has_gas',
       'engine_type', 'engine_capacity', 'body_type', 'has_warranty', 'state',
       'drivetrain', 'price_usd', 'is_exchangeable', 'location_region',
       'number_of_photos', 'up_counter', 'feature_0', 'feature_1', 'feature_2',
       'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7',
       'feature_8', 'feature_9', 'duration_listed'],
      dtype='object')


Unnamed: 0,manufacturer_name,model_name,transmission,color,odometer_value,year_produced,engine_fuel,engine_has_gas,engine_type,engine_capacity,...,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,duration_listed
0,Subaru,Outback,automatic,silver,190000,2010,gasoline,False,gasoline,2.5,...,True,True,True,False,True,False,True,True,True,16
1,Subaru,Outback,automatic,blue,290000,2002,gasoline,False,gasoline,3.0,...,True,False,False,True,True,False,False,False,True,83
2,Subaru,Forester,automatic,red,402000,2001,gasoline,False,gasoline,2.5,...,True,False,False,False,False,False,False,True,True,151
3,Subaru,Impreza,mechanical,blue,10000,1999,gasoline,False,gasoline,3.0,...,False,False,False,False,False,False,False,False,False,86
4,Subaru,Legacy,automatic,black,280000,2001,gasoline,False,gasoline,2.5,...,True,False,True,True,False,False,False,False,True,7


The features in the data are: 
*'manufacturer_name', 'model_name', 'transmission', 'color', 'odometer_value', 'year_produced', 'engine_fuel', 'engine_has_gas', 'engine_type', 'engine_capacity', 'body_type', 'has_warranty', 'state', 'drivetrain', 'price_usd', 'is_exchangeable', 'location_region', 'number_of_photos', 'up_counter', 'feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'duration_listed'.*

I selected sample feature 'transmission', 'color' and 'body_type' to use in building a linear regression model to predict the dependent variable 'duration_listed'. The first step after creating the Dataframe was to encode the categorical columns into numerical format. For this, I used the given one-hot and ordinal encoder examples. Using this, we created multiple new features for each type of transmission, color, and body_type, numerically representing the categorical values of these features. This allows for the use of a linear regression model. I then added these features into a new dataframe (df_processed) and split it into a testing and training set. These splits were used to train a linear regression model, which is used by the flask app's "infer" method to make predictions.

The flask app has two routes that output the model stats and predictions. The model stats for our example are low. There are many machine learning and analysis techniques that can be used to help boost the performance of our model:

1. Feature analysis. There are many methods to analyze the features in a dataset including PCA, correlation metrics, Fischer's ratio, and other statistic's based measures. The goal of feature analysis is to select the features that impact the dependent variable the most, or give the most information for the model to predict on. We want to be careful of over and underfitting the model to the training set, so features must be carefully analysed and selected for training.
2. Feature engineering. Feature engineering is the process analysing features and creating new features or representations of features based on information extracted from the existing data. Encoding features is an example of this, which we did in this assignment. After feature analysis, we can also create new features that describe the data in greater detail (ex. if transmission and body_type of car have some correlation, we can create a new feature representing the degree of this correlation). Feature engineering, with the use of statistics and code, can help process and refine the exisiting data and help the model train on better inputs.
3.  Model selection. I used a simple linear regression model to learn the numerical input data. To improve the performance, we can perform some experiments to determine the best type of model to use (SVM, KNN, decision tree/random forest, NN etc). 