# Python for Machine Learning

### *Session \#4*


### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**UP/DOWN ARROWS** --> Move cursor between cells (then ENTER to start typing)

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. Logistic Regression

### Warm Ups

*Type the given code into the cell below*

---

In [7]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from yellowbrick.classifier import ConfusionMatrix, ClassPredictionError, ROCAUC, PrecisionRecallCurve
from yellowbrick.target import ClassBalance

from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import RandomOverSampler

from matplotlib import pyplot as plt
%matplotlib inline

df = pd.read_excel('titanic.xlsx').dropna()

**Split into data sets**: 
```python
X = df[['age']]
y = df['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

**Create and fit classifier**: 
```python
model = LogisticRegression()
model.fit(X_train, y_train)
```

**Use model to classify**: `model.predict(X_test)`

**Use model to get probabilities**: `model.predict_proba(X_test)`

### Exercises
---

**1. Copy/paste the slope from** `model.coef_` **and the intercept from** `model.intercept_`

**2. Plot the underlying linear model using plt.plot().**

**First feed in** `X_test` **as the x-axis and** `X_test*slope + intercept` **as the y-axis**

**3. Use** `plt.scatter` **to plot** `X_test` **and** `y_test`, **and also to plot** `curve_x` and `curve_y` **which show the curve of the logistic classifier**

In [None]:
curve_x = np.linspace(-100, 200, 100).reshape(-1, 1)
curve_y = [a for a,b in model.predict_proba(curve_x)]

**4. What is your own probability of survival? Because the model expects a dataframe, you'll need to wrap your age in lists. So if your age is 50, you'll input** `[[50]]`

**5. Use one-hot encoding to add all the columns to your model. Call model.score() to see the accuracy of your model.**

Hint: Use `make_column_transformer` to separate categorical and numeric data

## II. Naive Bayes

### Warm Ups

*Type the given code into the cell below*

---

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.naive_bayes import MultinomialNB
from yellowbrick.classifier import ConfusionMatrix, ClassPredictionError

from sklearn.pipeline import make_pipeline

df = pd.read_csv("spam.csv", encoding = 'latin-1')

**Vectorize the words:** 
```
preprocess = CountVectorizer(stop_words='english', max_df=8, ngram_range=(1,3))
```

**Set up pipeline and model:** 
```
model = make_pipeline(preprocess, MultinomialNB())
model.fit(df['text'], df['category'])
```

**Use model on a sentence:** `model.predict(["You're our instant winner! Call now to claim your prize"])`

### Exercises
---

**1. Divide the spam dataset into** `X_train, X_test, y_train, y_test`

**2. Create a pipeline with a CountVectorizer and a MultinomialNB model**

**3. Fit the model to the spam dataset. What is the accuracy of the model?**

**4. What is the class balance between spam and non-spam?**

**5. Retrain your model using balanced data, utilizing** `RandomOverSampler()`

## II. Multi-class Classification

### Warm Ups

*Type the given code into the cell below*

---

In [3]:
df = pd.read_csv("tweets.csv")

### Exercises
---

**1. Train a Naive Bayes classifier on all the Twitter data. What is the accuracy of the model?**

**2. Use a** `ClassPredictionError` **plot to determine which celebrities are commonly mistaken for each other**


**3. Plot a** `ConfusionMatrix` **of your model. Which celebrity is hardest to classify?**         

**4. Change the hyperparameters of your Naive Bayes model to improve performance**

## IV. Twitter Bot

### Warm Ups

*Type the given code into the cell below*

---

**Set up Tweepy**

In [1]:
import tweepy
from random import choices

auth = tweepy.OAuthHandler('8hQMoNize5oshYZjaqai1DyVR', 'UhaEatmV6a0PdeuM1RWnsMNOJFHEmyNcFpsYb46czvnPJtzF1E')
auth.set_access_token('3629445439-YGOjZU7i4a455eDEtlUMmnx4koOYi0nax9ixGB7', 'eEwqBLLWKIHZAU9WwOFgAfRhclTNpUYTRG6j4k3EaJCkO')
api = tweepy.API(auth)

**Send a tweet:** `api.update_status()`

**See names of classes:** `model.classes_`

**Make random choices:** 
```python
words = choices(preprocess.get_feature_names(),
                weights=model.feature_count_[1],
                k=4)
```

**Join strings together:** `" ".join(words)`