## 0/ WHAT IS JUPYTER

- It is a platform for Python developers 
    - It can run Python code
    - It can also hold notes

- It compiles Python code and shows the result within the file

In [None]:
# sum is a varible 
sum = 2 + 2

# print sum
print(sum)


***

## 1/ DEFINE PROBLEM

### Project Summary
- You work as a data scientist at a used car buying and selling company
- Your goal is to predict the prices of cars based on historical car data

### Given Car Dataset

- Used car data from 1985


### Goal: 

- Predict price of car based on engine size
    - target variable (what to predict): price
    - predictor variable (what is given to us): engine-size


***

## 2/ DATA ACQUISITON

#### What is Pandas?

- Pandas is a Python library to work with DataFrames 
- For data manipulaton and analysis

In [None]:
## import library called pandas 
import pandas as pd 

## import dataset using pandas from an online repository
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data')
## data = Pandas DataFrame which has many built in functions

## add column labels 
header = ['symboling','normalized-losses','make','fuel-type','aspiration','num-of-doors','body-style','drive-wheels','engine-location','wheel-base','length','width','height','curb-weight','engine-type','num-of-cylinders','engine-size','fuel-system','bore','stroke','compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg','price']
data.columns = header

## check data and inspect missing values
data.head()





In [None]:
data.shape

***

## 3/ DATA PREPARATION

#### a/ missing value clean-up

- missing values in this dataset are denoted by ‘?’
    - these cannot be dropped directly
- replace them with numpy NaN values and drop the corresponding observations

#### What is NumPy?

- it is a Python library for working with datasets and matrices

In [None]:
## import library called numpy
import numpy as np

## replace '?' with numpy NaN (not a number)
data.replace('?', np.nan, inplace=True)

## drop the rows with NaN
data.dropna(inplace=True)

## check cleaned up data for engine-size
data[['engine-size','price']]




 #### b/ data-type clean-up
 


In [None]:
 print(data.dtypes)
 


In [None]:
 ## change the data type from Object to integer
 data['price'] = data['price'].astype('int')

***

## 4/ EXPLORATORY DATA ANALYSIS

- Engine-Size is a continuous variable 
- check out a short statistical summary



In [None]:
data['engine-size'].describe()
# does describe take input parameters

- check correlation between engine size and price

#### What is Seaborn?

- Seaborn is a data visualization Python library

In [None]:
import seaborn as sns

sns.regplot(x="engine-size", y="price", data=data)

***

## 5/ DATA MODELLING

- setup simple empty linear regression model to be used to predict price from engine-size
- we fit a line with equation: $$y = mx + c$$
    - we are trying to get 'm' and 'c' to predict 'y' (price) from 'x' (engine-size)

#### What is SkLearn

- it is a Python library for machine learning



In [None]:
# for linear regression:
from sklearn.linear_model import LinearRegression

# initialize a linear model
lm_engine_size = LinearRegression()


- training the linear model

In [None]:

# fit the linear model
lm_engine_size.fit( data[['engine-size']] , data['price'] )

- inspect the linear model created 
    - linear model: y = mx + c

In [None]:
print('c = ',lm_engine_size.intercept_)
print('m = ',lm_engine_size.coef_)

***

## 6/ VISUALIZATION OF THE MODEL

#### What is matplotlib?

- this is a Python library for data visualization 

In [None]:
from matplotlib import pyplot as plt

# initizlize plot   
plt.figure(0)

# plot the scatter plot for engine-size and price
plt.scatter(data[['engine-size']], data['price'])

# add labels
plt.xlabel('engine-size')
plt.ylabel('price')

# find the bounds of x values
x_bounds = plt.xlim()
# y_bounds = plt.ylim()
# print(x_bounds, y_bounds)

# draw the best-fit line
x_vals = np.linspace(x_bounds[0],x_bounds[1],num=50)
y_vals = lm_engine_size.intercept_ + lm_engine_size.coef_ * x_vals
# print(x_vals, y_vals)

plt.plot(x_vals, y_vals, '--')

plt.title('Engine-Size based Linear Price Estimator')

***

## Use Case of this System 

- if the car engine-size is 150, the price should be 16442.71

In [None]:

## y (price) = m (linear co-efficient) * x ( engine-size ) + c (linear intercept)
price = ( lm_engine_size.coef_[0] * (150) ) + lm_engine_size.intercept_
print(price)