In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model

# How Much is Your Car Worth?

Data about the retail price of 2005 General Motors cars can be found in `car_data.csv`.

The columns are:

1. Price: suggested retail price of the used 2005 GM car in excellent condition.
2. Mileage: number of miles the car has been driven
3. Make: manufacturer of the car such as Saturn, Pontiac, and Chevrolet
4. Model: specific models for each car manufacturer such as Ion, Vibe, Cavalier
5. Trim (of car): specific type of car model such as SE Sedan 4D, Quad Coupe 2D          
6. Type: body type such as sedan, coupe, etc.      
7. Cylinder: number of cylinders in the engine        
8. Liter: a more specific measure of engine size     
9. Doors: number of doors           
10. Cruise: indicator variable representing whether the car has cruise control (1 = cruise)
11. Sound: indicator variable representing whether the car has upgraded speakers (1 = upgraded)
12. Leather: indicator variable representing whether the car has leather seats (1 = leather)

## Tasks, Part 1

1. Find the linear regression equation for mileage vs price.
2. Chart the original data and the equation on the chart.
3. Find the equation's $R^2$ score (use the `.score` method) to determine whether the
equation is a good fit for this data. (0.8 and greater is considered a strong correlation.)

## Tasks, Part 2

1. Use mileage, cylinders, liters, doors, cruise, sound, and leather to find the linear regression equation.
2. Find the equation's $R^2$ score (use the `.score` method) to determine whether the
equation is a good fit for this data. (0.8 and greater is considered a strong correlation.)
3. Find the combination of the factors that is the best predictor for price.

## Tasks, Hard Mode

1. Research dummy variables in scikit-learn to see how to use the make, model, and body type.
2. Find the best combination of factors to predict price.

In [None]:
car_data = pd.read_csv("car_data.csv")

In [None]:
car_data


In [None]:
indicators = car_data[['Price', 'Mileage']]

In [None]:
indicators[['Price','Mileage']].dropna(thresh=2).head()

In [None]:
plt.scatter(indicators['Price'], 
            indicators['Mileage'])
plt.show()

In [None]:
df = indicators.loc[:, ['Mileage', 'Price']]

df.dropna(inplace=True)
mileage_data = df[['Mileage']]
price_data = df[['Price']]


In [None]:
regr1 = linear_model.LinearRegression()
regr1.fit(mileage_data, price_data)
print('Coefficients: ', regr1.coef_)
print(regr1.score(mileage_data, price_data))

A negative correlation coefficient means that, for any two variables X and Y, an increase in X is associated with a decrease in Y. A negative correlation demonstrates a connection between two variables in the same way a positive correlation coefficient does, and the relative strengths are the same. -Lance explained this to me. I had to give him credit for it.

In [None]:
plt.scatter(mileage_data, price_data , color='red')
plt.plot(mileage_data, regr1.predict(mileage_data), color='blue', linewidth=3)
plt.show()

Based on the r**2 coeffiecient and the score, linear regression is not a good fit for this data. There is barely a correlation between mileage and price

In [None]:
regr1.predict(10000)[0]

###Task 2

In [None]:

new_data = car_data.loc[:, ['Mileage', 
                        'Cylinder',
                        'Liter',
                        'Doors', 
                        'Cruise', 
                        'Sound',
                        'Leather',
                        'Price']]

new_data.dropna(inplace=True)
input_data = car_data[['Mileage', 
                 'Cylinder',
                 'Liter',
                 'Doors',
                 'Cruise',
                 'Sound',
                 'Leather']]
output_data = car_data['Price']


In [172]:
regr2 = linear_model.LinearRegression()
regr2.fit(input_data, output_data)
print('Coefficients: ', regr2.coef_)
print(regr2.score(input_data, output_data))

Coefficients:  [ -1.69747832e-01   3.79237893e+03  -7.87220732e+02  -1.54274585e+03
   6.28899715e+03  -1.99379528e+03   3.34936162e+03]
0.446264353673


This equation is not a good fit for this data.  There is not a strong correlation between Mileage, Cylinder, Liter, Doors, Cruise, Sound, Leather, and Price. The r**2 is .45.


Now, I am going to change the features in order to try and find the best predictor for price.

In [178]:

new_data1 = car_data.loc[:,[
                        'Leather',
                        'Cruise',
                        'Doors',
                        'Cylinder',
                        'Price'
                        ]]

new_data1.dropna(inplace=True)
input_data1 = car_data[[
                 'Cruise',
                 'Cylinder',
                 'Leather',
                 'Doors',
                 ]]
output_data1 = car_data['Price']


regr3 = linear_model.LinearRegression()
regr3.fit(input_data1, output_data1)
print('Coefficients: ', regr3.coef_)
print(regr3.score(input_data1, output_data1))

Coefficients:  [ 6191.20042916  3301.34732506  2959.5780559  -1378.68918044]
0.417760273702


I was having trouble finding the best predictor for price. I know when your not including cylinder or liter the correlation drastically decreases. 