# Python for Machine Learning

### *Session \#1*


### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**UP/DOWN ARROWS** --> Move cursor between cells (then ENTER to start typing)

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. Preparing Data

### Warm Ups

*Type the given code into the cell below*

---

**Import pandas and read CSV**: 
```python
import pandas as pd
df = pd.read_csv("diamonds.csv")
```

In [1]:
import pandas as pd
df = pd.read_csv("diamonds.csv")

**Isolate a column:** `y = df['carat']`

**Use subset of columns**
```python
columns = ['depth', 'table']
X = df[columns]
```

**Drop columns:** `df.drop(['carat'], axis='columns')`

**Split data into train/test sets:**
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

*Note: Default split is 75% train, 25% test. Can change proportion using* `test_size` *parameter*

In [None]:
from sklearn.model_selection import train_test_split

### Exercises
---

**1. Create the feature matrix** `X` **by forming a dataframe from the columns** `x, y, z,` **and** `carat`

In [5]:
cols = ['x', 'y', 'z', 'carat']
X = df[cols]

**2. Create the same feature matrix** `X`, **this time by dropping the non-numeric columns** `cut, color, clarity` **and the target vector** `price`

In [7]:
nonnumeric = ['cut', 'color', 'clarity', 'price']
X = df.drop(nonnumeric, axis='columns')

**3. Create the target vector** `y` **from the column** `price`

In [10]:
y = df['price']

**4. Use** `train_test_split` **to divide your data into** `X_train`, `X_test`, `y_train`, `y_test`

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

## II. Scikit-Learn Basics

### Warm Ups

*Type the given code into the cell below*

---

**Choose model and hyperparameters:**
```python
from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
```

**Fit model to data:**`model.fit(X_train, y_train)`                    

**Use model to predict:** `y_predicted = model.predict(X_test)`

**Find average error:** 
```python
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_predicted, y_test)
```

In [None]:
from sklearn.metrics import mean_absolute_error

**Find R2 score:** `model.score(X_test, y_test)`

### Exercises
---

**1. Create a LinearRegression model**

In [21]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)

**2. Fit the model using** `X_train` **and** `y_train`

In [22]:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)

**3. Make a** `y_predicted` **variable using** `model.predict()` **on** `X_test`

In [24]:
y_predicted = model.predict(X_test)

**4. Use** `y_predicted` **and** `y_test` **to find the mean absolute error of your model**

In [25]:
mean_absolute_error(y_predicted, y_test)

889.3546684612477

**5. Use** `X_test` **and** `y_test` **to find the R2 score of your model**

In [28]:
df['cut'].value_counts()

Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: cut, dtype: int64

## III. One-Hot Encoding and Pipelines

### Warm Ups

*Type the given code into the cell below*

---

**Imports for One-Hot Encoding and Pipelines:**

In [67]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer

**Add encoding to model using pipeline:**
```python
model = make_pipeline(OneHotEncoder(), LinearRegression())
```

In [31]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder

model = make_pipeline(OneHotEncoder(), LinearRegression())

**Add StandardScaler to model using pipeline:**
```python
model = make_pipeline(StandardScaler(), LinearRegression())
```

**Split pipeline between two types of columns:**
```python
numeric = ['x', 'y', 'z', 'carat', 'table', 'depth']
categorical = ['cut', 'color', 'clarity']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric),
        ('cat', OneHotEncoder(), categorical)
    ])

model = make_pipeline(preprocessor, LinearRegression())
```

### Exercises
---

**1. Create a** `OneHotEncoder(sparse=False)` **object as** `encoder` **and call** `encoder.fit_transform()` **on the categorical columns of X**

Note: The `sparse=False` parameter makes the output a normal Numpy array

In [64]:
encoder = OneHotEncoder(sparse=False)

print(X.head())
print(encoder.fit_transform(X[categorical]))
print(encoder.get_feature_names())

       cut color clarity
0    Ideal     E     SI2
1  Premium     E     SI1
2     Good     E     VS1
3  Premium     I     VS2
4     Good     J     SI2
[[0. 0. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]]
['x0_Fair' 'x0_Good' 'x0_Ideal' 'x0_Premium' 'x0_Very Good' 'x1_D' 'x1_E'
 'x1_F' 'x1_G' 'x1_H' 'x1_I' 'x1_J' 'x2_I1' 'x2_IF' 'x2_SI1' 'x2_SI2'
 'x2_VS1' 'x2_VS2' 'x2_VVS1' 'x2_VVS2']


**2. Create a model based on only the categorical features of X. Fit the model to the data, and find the mean absolute error and R2 score.**

Hint: Use a pipeline and OneHotEncoder()

In [56]:
categorical = ['cut', 'color', 'clarity']
X = df[categorical]
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y)

model = make_pipeline(OneHotEncoder(), LinearRegression())

model.fit(X_train, y_train)
model.score(X_test, y_test)

y_predicted = model.predict(X_test)
mean_absolute_error(y_predicted, y_test)


# print(model.named_steps.onehotencoder.get_feature_names())
# model.named_steps.linearregression.coef_

2857.1149942497245

**3. Create a model based on only the numeric features of X, which preprocesses data using StandardScaler(). Fit the model to the data, and find the mean absolute error and R2 score.**

In [69]:
numeric = ['x', 'y', 'z', 'carat', 'table', 'depth']
categorical = ['cut', 'color', 'clarity']

X = df[numeric]
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y)

model = make_pipeline(StandardScaler(), LinearRegression())

model.fit(X_train, y_train)
model.score(X_test, y_test)

y_predicted = model.predict(X_test)
mean_absolute_error(y_predicted, y_test)


0.8589517896911343

**4. Use a ColumnTransformer to create a model with both** `OneHotEncoder()` **and** `StandardScaler()` **preprocessing for the appropriate columns.**

**Fit the model to the data, and find the mean absolute error and R2 score.**

In [73]:
numeric = ['x', 'y', 'z', 'carat', 'table', 'depth']
categorical = ['cut', 'color', 'clarity']



X = df.drop("price", axis='columns')
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric),
        ('cat', OneHotEncoder(), categorical)
    ])

model = make_pipeline(preprocessor, LinearRegression())

model.fit(X_train, y_train)
model.score(X_test, y_test)


In [78]:
from sklearn.compose import make_column_transformer

preprocessor = make_column_transformer(
                 (OneHotEncoder(), X.select_dtypes(['object']).columns),
                 (StandardScaler(), X.select_dtypes(['float64', 'int64']).columns)
               )
    
model = make_pipeline(preprocessor, LinearRegression())

In [80]:
model.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(categorical_features=None,
                                                                categories=None,
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                n_values=None,
                                                                sparse=True),
                                                  Index(['cut', 'color', 'clarity'], dtype='object')),