# Exercise
The exercise will attempt to convey important skills and concepts in applying data science in a production environment.
- Thinking about the data
 - what is it like and what could I do with it?
 - what is the business objective?
- Generating value from ML:
 - establishing a baseline
 - improving as required
 

## The challange that you have been set:
You are part of a fast growing social media startup HypeVentures that provides chat and discussion space technology to other startup to improve the customer engagement with the content the users post. Management has hired you to sort out their marketing messaging as the last marketing guy quit and they intend to hire a bunch of different topic experts to deal with the different forums.
 

- what do people talk about?
 1. Can we cluster the conversation topics?
 2. Can we label some of them by hand?
 3. Use that to label the rest?
- Classify some new incomming data?

## Lessons learnt
- Humble pie
 - unlabled data is hard (unsupervised learning)
 - human labeling is extremely valuable
 - Always check your work
- Curse of dimensionality 

## Next steps
- get the data labeled [Amazon SageMaker](https://aws.amazon.com/sagemaker/groundtruth/pricing/)
- Can we train a classifer?

In [None]:
# Imports we'll need
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from exercise_utils import plot_confusion_matrix, get_bank_data


In [None]:
# Sklean features
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer

Data ingress
load 'data_comp.graphics_sci.med_categories_train.csv' into a pandas dataframe and inspect it
load 'data_comp.graphics_sci.med_categories_test.csv' into a pandas dataframe and inspect it


In [None]:
df2 = pd.read_csv('data_comp.graphics_sci.med_categories_train.csv').dropna()
df2_test = pd.read_csv('data_comp.graphics_sci.med_categories_test.csv').dropna()
df2.head()

Vecotrize the text data 

1. Either using a counting vectorizer 
2. or a hashing one

Using english default stop words:
stop words are words like: and, to, I etc. Complete list for a common usecase: https://gist.github.com/sebleier/554280

In [None]:
# Count vectorization again (with stopwords this time)
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(df2.data)

In [None]:
# Transform the data as before (SVD and normalization)
svd = TruncatedSVD(n_components=40)
X_reduced = svd.fit_transform(X_train_counts)

norm = Normalizer()
X = norm.fit_transform(X_reduced)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.pipeline import Pipeline

In [None]:
# we need non negative values, so add the (absolute) minimum to all
X += abs(X.min())

In [None]:
# Let's fit a linear model (SGDClassifier with hinge loss)
# Essentialy and SVM with stochastic gradient descent
# more info here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
clfb = SGDClassifier(loss='hinge',tol=None,random_state=0, max_iter=5).fit(X, df2.subject)

In [None]:
# inspect the model we have specifiend
clfb

In [None]:
# get the targets 
targets = df2.groupby('subject')['category'].first().sort_index().tolist()
targets

In [None]:
# Make up some sentences and see what it predicts
docs_new = [
    "My new Intel CPU is great",
    "I have a really bad cold",
]

In [None]:
# Process the sentences and predict
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = svd.transform(X_new_counts)
predicted = clfb.predict(X_new_tfidf)

In [None]:
# Print predictions in a pretty way
for doc, category in zip(docs_new, predicted):
    print(f'{doc} => {targets[category]}')

In [None]:
# Let's wrap the classifier in a sklearn Pipeline, so we don't need to redo all the steps
# Let's also try and use a different model, say a Naive Bayes approach
text_clf = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

In [None]:
# Fit the pipeline
text_clf.fit(df2.data, df2.subject)  

In [None]:
# Score the pipeline
text_clf.score(df2_test.data, df2_test.subject)

In [None]:
# Now do the same for our SDG Classifier
text_clf = Pipeline([
   ('vect', CountVectorizer(stop_words='english')),
   ('tfidf', TfidfTransformer()),
   ('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=1e-3, random_state=42,
                         max_iter=50, tol=None)),
])

In [None]:
text_clf.fit(df2.data, df2.subject)  

In [None]:
text_clf.score(df2_test.data, df2_test.subject) 

In [None]:
predicted = text_clf.predict(df2_test.data)

In [None]:
# Create a classificalt report and confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(df2_test.subject, predicted,
    target_names=targets))

In [None]:
# Using plot_confusion_matrix from the helper module, plot the matrix to inspect it
cm = confusion_matrix(df2_test.subject, predicted)
plot_confusion_matrix(df2_test.subject, predicted,targets,
                          normalize=True)
plt.show()

## Bonus material
With more difficult data or more and different data.
load one of the following: 



data_all_categories_train and test

data_alt.atheism_soc.religion.christian_categories_train and test

data_comp.graphics_sci.med_alt.atheism_soc.religion.christian_categories train and test

In [None]:
file = lambda x: f'data_all_categories_{x}.csv'
#file = lambda x: f'data_alt.atheism_soc.religion.christian_categories_{x}.csv'
#file = lambda x: f'data_comp.graphics_sci.med_alt.atheism_soc.religion.christian_categories_{x}.csv'
df = pd.read_csv(file('train')).dropna()
df_test = pd.read_csv(file('test')).dropna()
df.head()

In [None]:
X_train = df.data
y_train = df.subject

targets = df.groupby('subject')['category'].first().sort_index().tolist()
X_test = df_test.data
y_test = df_test.subject

In [None]:
text_clf.fit(X_train, y_train)  
y_pred = text_clf.predict(X_test)

print(classification_report(y_test, y_pred,
    target_names=targets))
plot_confusion_matrix(y_test, y_pred,targets,
                          normalize=True)
plt.show()

## More bonus material

Real complaints about banks in America

load the data from bank_data.csv or use get_bank_data, only use the top 10_000 entries, unless you have a very powerfull PC or lot's of time. 

In [None]:
df = pd.read_csv("bank_data.csv").head(20_000)
df.head()

In [None]:
# Let's grab the targets again
targets = df.groupby('subject')['category'].first().sort_index().tolist()
targets

In [None]:
# Use the train_test_split from sklearn to make a train and a test data set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.data, df.subject)

In [None]:
# Fit the model and plot the confusion matrix
text_clf.fit(X_train, y_train)  
y_pred = text_clf.predict(X_test)
plot_confusion_matrix(y_test, y_pred,targets,
                          normalize=False)
plt.show()

In [None]:
#Let's add some better stopwords to see if we can improve things
from sklearn.feature_extraction import text 
stop_words = text.ENGLISH_STOP_WORDS.union(['XX','XXX', 'XXXX'])
text_clf.set_params(vect__stop_words=stop_words)

In [None]:
# again we fit and evaluate
text_clf.fit(X_train, y_train)  
y_pred = text_clf.predict(X_test)

plot_confusion_matrix(y_test, y_pred,targets,
                          normalize=False)
plt.show()

In [None]:
print(classification_report(y_test, y_pred, target_names=targets))

In [None]:
# let's look at more metrics
from sklearn.metrics import accuracy_score, f1_score
print(f'Testing accuracy {accuracy_score(y_test, y_pred):.2%}')
print(f"Testing F1 score: {f1_score(y_test, y_pred, average='weighted'):.2%}" )

### Bonus bonus material
For doc2vec / gensim approach to the same bank data checkout
https://github.com/susanli2016/NLP-with-Python/blob/master/Doc2Vec%20Consumer%20Complaint_3.ipynb