# Python for Machine Learning

### *Session \#1*


### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**UP/DOWN ARROWS** --> Move cursor between cells (then ENTER to start typing)

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. Preparing Data

### Warm Ups

*Type the given code into the cell below*

---

**Import pandas and read CSV**: 
```python
import pandas as pd
df = pd.read_csv("diamonds.csv")
```

In [3]:
import pandas as pd
df = pd.read_csv("diamonds.csv")

**Isolate a column:** `y = df['carat']`

**Use subset of columns**
```python
columns = ['depth', 'table']
X = df[columns]
```

**Drop columns:** `df.drop(['carat'], axis=1)`

**Split data into train/test sets:**
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

*Note: Default split is 75% train, 25% test. Can change proportion using* `test_size` *parameter*

### Exercises
---

**1. Create the feature matrix** `X` **by forming a dataframe from the columns** `x, y, z,` **and** `carat`

**2. Create the same feature matrix** `X`, **this time by dropping the non-numeric columns** `cut, color, clarity` **and the target vector** `price`

**3. Create the target vector** `y` **from the column** `price`

**4. Use** `train_test_split` **to divide your data into** `X_train`, `X_test`, `y_train`, `y_test`

## II. Scikit-Learn Basics

### Warm Ups

*Type the given code into the cell below*

---

**Choose model and hyperparameters:**
```python
from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
```

In [7]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)

**Fit model to data:**`model.fit(X_train, y_train)`                    

**Use model to predict:** `y_predicted = model.predict(X_test)`

**Find average error:** 
```python
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_predicted, y_test)
```

**Find R2 score:** `model.score(X_test, y_test)`

### Exercises
---

**1. Create a LinearRegression model**

**2. Fit the model using** `X_train` **and** `y_train`

**3. Make a** `y_predicted` **variable using** `model.predict()` **on** `X_test`

**4. Use** `y_predicted` **and** `y_test` **to find the mean absolute error of your model**

**5. Use** `X_test` **and** `y_test` **to find the R2 score of your model**

## III. One-Hot Encoding and Pipelines

### Warm Ups

*Type the given code into the cell below*

---

**Imports for One-Hot Encoding and Pipelines:**

In [9]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

**Add encoding to model using pipeline:**
```python
model = make_pipeline(OneHotEncoder(), LinearRegression())
```

**Add StandardScaler to model using pipeline:**
```python
model = make_pipeline(StandardScaler(), LinearRegression())
```

**Split pipeline between two types of columns:**
```python
numeric = ['x', 'y', 'z', 'carat', 'table', 'depth']
categorical = ['cut', 'color', 'clarity']

column_transformer = make_column_transformer(
        (StandardScaler(), numeric),
        (OneHotEncoder(), categorical)
    )

model = make_pipeline(preprocessor, LinearRegression())
```

### Exercises
---

**1. Create a** `OneHotEncoder(sparse=False)` **object as** `encoder` **and call** `encoder.fit_transform()` **on the categorical columns of X**

Note: The `sparse=False` parameter makes the output a normal Numpy array

**2. Create a model based on only the categorical features of X. Fit the model to the data, and find the mean absolute error and R2 score.**

Hint: Use a pipeline and OneHotEncoder()

**3. Create a model based on only the numeric features of X, which preprocesses data using StandardScaler(). Fit the model to the data, and find the mean absolute error and R2 score.**

**4. Use a ColumnTransformer to create a model with both** `OneHotEncoder()` **and** `StandardScaler()` **preprocessing for the appropriate columns.**

**Fit the model to the data, and find the mean absolute error and R2 score.**