Dylan Hastings

## 1: Face Recognition, but not evil this time

Using the faces dataset in:

```
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
```

If you use the `faces.target` and `faces.target_names` attributes, you can build a facial recognition algorithm.

Use sklearn **gridsearch** (or an equivalent, like random search) to optimize the model for accuracy. Try both a SVM-based classifier and a logistic regression based classifier (with a feature pipeline of your choice) to get the best model. You should have at least 80% accuracy.

In [34]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

In [35]:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)

In [36]:
X = faces.data
y = faces.target

In [37]:
from sklearn.svm import SVC # "Support vector classifier"
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.pipeline import Pipeline

In [38]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)

In [39]:
pca = PCA()

In [40]:
svc = SVC()

In [41]:
param_grid = {'svc__C': [1, 5, 10, 50],
              'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]}

In [42]:
model = Pipeline([('pca', PCA(n_components=150, whiten=True)),
                ('svc', SVC())])

In [43]:
g = GridSearchCV(model,param_grid)

In [44]:
g.fit(X_train,y_train)

GridSearchCV(estimator=Pipeline(steps=[('pca',
                                        PCA(n_components=150, whiten=True)),
                                       ('svc', SVC())]),
             param_grid={'svc__C': [1, 5, 10, 50],
                         'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]})

In [45]:
y_pred = g.predict(X_test)

In [46]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.827893175074184

# 2: Bag of Words, Bag of Popcorn

By this point, you are ready for the [Bag of Words, Bag of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) competition. 

Use NLP feature pre-processing (using, SKLearn, Gensim, Spacy or Hugginface) to build the best classifier you can. Use a  feature pipeline, and gridsearch for your final model.

A succesful project should get 90% or more on a **holdout** dataset you kept for yourself.

In [41]:
# This model is based on the following reference:
# https://www.kaggle.com/lcukerd/rating-predictor

In [6]:
import re
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

In [7]:
# The tsv files were extracted from Kaggle and put a 'data' subfolder that I created
train = pd.read_csv('data/labeledTrainData.tsv',header = 0, delimiter = '\t')
test = pd.read_csv('data/testData.tsv',header = 0, delimiter = '\t')

In [8]:
reviews = train['review']
sentiments = train['sentiment']

In [9]:
reviewsT = test['review']

In [10]:
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='ascii',
    analyzer='word',
    stop_words='english',
    ngram_range=(1, 1),
    max_features=10000)
word_vectorizer.fit(reviews + reviewsT)

TfidfVectorizer(max_features=10000, stop_words='english', strip_accents='ascii',
                sublinear_tf=True)

In [11]:
X = word_vectorizer.transform(reviews)
X.shape

(25000, 10000)

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, sentiments,test_size=0.25)

In [29]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier()

In [25]:
param_grid = {'forest__n_estimators': [64, 100, 128]}

In [26]:
model = Pipeline([('forest', RandomForestClassifier())])

In [27]:
g = GridSearchCV(model,param_grid)

In [28]:
g.fit(X_train,y_train)

GridSearchCV(estimator=Pipeline(steps=[('forest', RandomForestClassifier())]),
             param_grid={'forest__n_estimators': [64, 100, 128]})

In [30]:
y_pred = g.predict(X_test)

In [33]:
accuracy_score(y_test, y_pred)

0.84672