# Machine Learning sample class!

This walkthrough is (mostly) based on a series of pieces I wrote up for [investigate.ai](http://investigate.ai/), including:

- [Comparing sentiment analysis tools](https://investigate.ai/investigating-sentiment-analysis/comparing-sentiment-analysis-tools/)
- [Designing your own sentiment analysis tool](https://investigate.ai/investigating-sentiment-analysis/designing-your-own-sentiment-analysis-tool/)
- [Apple says its App Store is ‘a safe and trusted place.’ We found 1,500 reports of unwanted sexual behavior on six apps, some targeting minors.](https://www.washingtonpost.com/technology/2019/11/22/apple-says-its-app-store-is-safe-trusted-place-we-found-reports-unwanted-sexual-behavior-six-apps-some-targeting-minors/) (Washington Post)
- [Predicting reports of bullying, racism, and unwanted sexual behavior from app store reviews](https://investigate.ai/wapo-app-reviews/predict-reviews/)
- [How Quartz used AI to sort through the Luanda Leaks](https://qz.com/1786896/ai-for-investigations-sorting-through-the-luanda-leaks) (if it looks blank, scroll down)
- [Comparing documents across languages with Universal Sentence Encoding and Tensorflow](https://investigate.ai/text-analysis/comparing-documents-in-different-languages/)
- [Uncovering surveillance planes with BuzzFeed](https://investigate.ai/buzzfeed-spy-planes/)
- [GitHub Copilot](https://github.com/features/copilot)

Let's have some fun! You might also enjoy the vaguely similar [Sentiment to Spyplanes](https://github.com/jsoma/sentiment-to-spyplanes).

# How to use this notebook

This is a **Jupyter notebook**, a way of sharing and annotating code. Here's how you use it:

1. Click near the top.
2. Hold down shift.
3. Press enter. It runs some code.
4. Go back to step 2

Congratulations, you're a programmer!

# Installation fun

We'll need to install a few tools before we move on. It doesn't matter what they are, really, but I'll tell you anyway:

* **NLTK:** text and sentiment analysis (old workhorse)
* **TextBlob:** text and sentiment analysis (a bit more convenient than NLTK)
* **spaCy:** natural language processing
* **jieba:** tokenizer for Chinese
* **scikit-learn:** classical machine learning
* **Hugging Face transformers:** advaned machine learning tools

In [1]:
!pip install --quiet eli5 matplotlib pandas nltk textblob spacy jieba scikit-learn transformers datasets evaluate

You should consider upgrading via the '/Users/soma/.pyenv/versions/3.10.4/bin/python3.10 -m pip install --upgrade pip' command.[0m[33m
[0m

...and now a little additional setup for our old friend NLTK....

In [1]:
import nltk

nltk.download('vader_lexicon', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('movie_reviews', quiet=True)

True

Download a couple datasets for later...

In [2]:
!wget --quiet -O reviews-marked.csv "https://github.com/jsoma/sentiment-to-spyplanes/blob/master/reviews-marked.csv?raw=true"
!wget --quiet -O sentiment140-subset.csv "https://github.com/jsoma/sentiment-to-spyplanes/blob/master/sentiment140-subset.csv?raw=true"

# Our noble goal

Our goal today is for you to understand this sentence:
    
> "train a binary classification model on the tokenized 'Review' column using a huggingface transformer (uncased dilbert)"

No guarantees, but we'll try!

# To start: sentiment Analysis

[Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is the technique of seeing whether a piece of content is positive or negative. Let's start by feeding some sentences into a sentiment analysis tool to see what happens!

In [3]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

sia = SIA()

sia.polarity_scores("I love this kitten")

{'neg': 0.0, 'neu': 0.323, 'pos': 0.677, 'compound': 0.6369}

In [4]:
text = "I hate this keyboard"
sia.polarity_scores(text)

{'neg': 0.649, 'neu': 0.351, 'pos': 0.0, 'compound': -0.5719}

In [5]:
text = "Your feedback is appreciated :)"
sia.polarity_scores(text)

{'neg': 0.0, 'neu': 0.323, 'pos': 0.677, 'compound': 0.743}

In [6]:
text = "Your feedback is appreciated 🤮"
sia.polarity_scores(text)

{'neg': 0.0, 'neu': 0.476, 'pos': 0.524, 'compound': 0.5106}

In [7]:
text = "That restaurant was great, but I'm not sure if I'll go there again"
sia.polarity_scores(text)

{'neg': 0.153, 'neu': 0.688, 'pos': 0.159, 'compound': 0.0276}

In [8]:
text = "This article was pure garbage"
sia.polarity_scores(text)

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

---

## Vocabulary check: binary classification

> train a `binary classification model` on the tokenized 'Review' column using a huggingface transformer (uncased dilbert)"

**Binary classification** is putting things into two categories. Positive/negative, interesting/boring, etc. A model is the tool that does the classification for us.

---

## TextBlob

TextBlob is another library for performing text analysis, and it has **two ways** of performing sentiment analysis.

### Option A

In [9]:
from textblob import TextBlob
from textblob import Blobber
from textblob.sentiments import NaiveBayesAnalyzer

In [10]:
blob = TextBlob("I love this kitten")
blob.sentiment

Sentiment(polarity=0.5, subjectivity=0.6)

In [11]:
blob = TextBlob("I hate this keyboard")
blob.sentiment

Sentiment(polarity=-0.8, subjectivity=0.9)

In [12]:
blob = TextBlob("This article was pure garbage")
blob.sentiment

Sentiment(polarity=0.21428571428571427, subjectivity=0.5)

### Option B

In [13]:
blobber = Blobber(analyzer=NaiveBayesAnalyzer())

blob = blobber("This article was pure garbage")
blob.sentiment

Sentiment(classification='neg', p_pos=0.3898306696279278, p_neg=0.610169330372073)

# Comparing all of our sentiment analysis tools

In [14]:
import pandas as pd

sentences = pd.DataFrame({'content': [
    "I love this kitten",
    "I hate keyboard",
    "I appreciate the feedback :)",
    "I appreciate the feedback 🤮",
    "This article was garbage",
    "This article was pure garbage",
    "That restaurant was great, but I'm not sure if I'll go there again",
    "I'm not sure how I feel about toast",
    "Did you see the baseball game yesterday?",
    "The package was delivered late and the contents were broken",
    "Trashy television shows are some of my favorites",
    "I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",
    "I find chirping birds irritating, but I know I'm not the only one",
]})

sentences

Unnamed: 0,content
0,I love this kitten
1,I hate keyboard
2,I appreciate the feedback :)
3,I appreciate the feedback 🤮
4,This article was garbage
5,This article was pure garbage
6,"That restaurant was great, but I'm not sure if..."
7,I'm not sure how I feel about toast
8,Did you see the baseball game yesterday?
9,The package was delivered late and the content...


In [15]:
def get_scores(content):
    blob = TextBlob(content)
    nb_blob = blobber(content)
    sia_scores = sia.polarity_scores(content)
    
    return pd.Series({
        'content': content,
        'textblob': blob.sentiment.polarity,
        'textblob_bayes': nb_blob.sentiment.p_pos - nb_blob.sentiment.p_neg,
        'nltk': sia_scores['compound'],
    })

scores = sentences.content.apply(get_scores)
scores.style.background_gradient(cmap='RdYlGn', axis=None, low=0.4, high=0.4)

Unnamed: 0,content,textblob,textblob_bayes,nltk
0,I love this kitten,0.5,-0.087933,0.6369
1,I hate keyboard,-0.8,-0.206089,-0.5719
2,I appreciate the feedback :),0.5,-0.299545,0.6908
3,I appreciate the feedback 🤮,0.0,-0.299545,0.4019
4,This article was garbage,0.0,-0.519103,0.0
5,This article was pure garbage,0.214286,-0.220339,0.0
6,"That restaurant was great, but I'm not sure if I'll go there again",0.275,0.186505,0.0276
7,I'm not sure how I feel about toast,-0.25,0.394659,-0.2411
8,Did you see the baseball game yesterday?,-0.4,0.61305,0.0
9,The package was delivered late and the contents were broken,-0.35,-0.57427,-0.4767


## What's it used for?

* UpShot's Trump + State of the Union: https://www.nytimes.com/interactive/2017/02/28/upshot/trump-sounds-different-tone-in-first-address-to-congress.html
* WaPo's App Stores: https://www.washingtonpost.com/technology/2019/11/22/apple-says-its-app-store-is-safe-trusted-place-we-found-reports-unwanted-sexual-behavior-six-apps-some-targeting-minors/
* AJC's Doctors and Sex Abuse: http://doctors.ajc.com/
* BuzzFeed's Spies in the Skies: https://www.buzzfeednews.com/article/peteraldhous/hidden-spy-planes
* Trump on Twitter: https://www.nytimes.com/interactive/2019/11/02/us/politics/trump-twitter-presidency.html

---

## Vocabulary check: training

> `train a binary classification model` on the tokenized 'Review' column using a huggingface transformer (uncased dilbert)"

**Training** is the process of teaching your model. For a binary classifier, this involves showing the model example from both categories, e.g., "here are a lot of positive tweets, here are a lot of negative tweets."

---

# Building our sentiment analysis tools

We'll start by reading in a list of tweets that are tagged as either positive or negative.

In [16]:
import pandas as pd
pd.options.display.max_colwidth = None
pd.options.display.max_columns = 100

df = pd.read_csv("sentiment140-subset.csv", nrows=1000)
df.head()

Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was afraid I was gonna crash twitter with all the spamming I did 2 RR..sorry bout that"
3,1,Wii fit says I've lost 10 pounds since last time
4,0,@MrKinetik Not a thing!!! I don't really have a life.....


## Tokenizing and vectorizing

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()



Unnamed: 0,00,09,10,10am,10pm,11,12,14,17,20,2day,2nd,30,able,about,absolutely,account,ace,ache,actually,africa,after,afternoon,again,ago,agree,ahead,ahh,ahhh,aint,air,alabama,all,allowed,ally,almost,already,alright,also,always,am,amazing,america,amount,amp,an,and,another,answer,any,...,while,white,who,whole,why,wife,will,win,wine,wish,wishes,wishing,with,without,wives,wohoo,woke,won,wonder,wondering,words,work,working,works,world,would,wow,write,wrong,www,xd,ya,yay,yea,yeah,yeahh,year,yelling,yerp,yes,yesterday,yet,yo,york,you,your,yours,youtube,yrs,yup
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Tokenizing non-English languages

Languages that like to combine multiple words into one – like German – are a little tougher than the "separate words by spaces" rule we can use in English. [CJK languages](https://en.wikipedia.org/wiki/CJK_characters) - Chinese, Japanese and Korean - are even more difficult, as they typically don't use spaces, and require special attention and rules.

## Tokenizing Chinese

In [18]:
import jieba

jieba.lcut('翠花买了浅蓝色的鱼')

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/_m/b8tjbm6n4zs1q2mvjvg25x1m0000gn/T/jieba.cache
Loading model cost 0.372 seconds.
Prefix dict has been built successfully.


['翠花', '买', '了', '浅蓝色', '的', '鱼']

In [19]:
texts_zh = [
  '翠花买了浅蓝色的鱼',
  '翠花买了浅蓝橙色的鱼',
  '猫在商店吃了一条鱼',
  '翠花去了商店。翠花买了一只虫子。翠花看到一条鱼',
  '翠花是鱼'  
]

def tokenize_zh(text):
    words = jieba.lcut(text)
    return words

vectorizer = CountVectorizer(tokenizer=tokenize_zh, stop_words=['。', '，'])
matrix = vectorizer.fit_transform(texts_zh)

words_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
words_df.index = texts_zh
words_df



Unnamed: 0,一只,一条,买,了,去,吃,商店,在,是,橙色,浅蓝,浅蓝色,猫,的,看到,翠花,虫子,鱼
翠花买了浅蓝色的鱼,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,1,0,1
翠花买了浅蓝橙色的鱼,0,0,1,1,0,0,0,0,0,1,1,0,0,1,0,1,0,1
猫在商店吃了一条鱼,0,1,0,1,0,1,1,1,0,0,0,0,1,0,0,0,0,1
翠花去了商店。翠花买了一只虫子。翠花看到一条鱼,1,1,1,2,1,0,1,0,0,0,0,0,0,0,1,3,1,1
翠花是鱼,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1


---

## Vocabulary check: tokenization

> `train a binary classification mode on the tokenized 'Review' column` using a huggingface transformer (uncased dilbert)"

**Tokenizing** is separating a piece of texts into individual words, while **vectorizing** is the process of counting those words.

---

# Building a binary classifier

We'll be reproducing part of [Apple says its App Store is ‘a safe and trusted place.’ We found 1,500 reports of unwanted sexual behavior on six apps, some targeting minors](https://www.washingtonpost.com/technology/2019/11/22/apple-says-its-app-store-is-safe-trusted-place-we-found-reports-unwanted-sexual-behavior-six-apps-some-targeting-minors/?arc404=true), from the Washington Post.

In [20]:
import pandas as pd
pd.set_option("display.max_colwidth", 300)

# Read in our data, then drop ones without a text
# review and get rid of a few unwannted columns
df = pd.read_csv("reviews-marked.csv")
df = df.dropna(subset=['Review'])
df = df.drop(columns=['Country', 'Date', 'Version'])

known = df[df.sexual.notna()].copy()
unknown = df[df.sexual.isna()].copy()

known.head(10)

Unnamed: 0,Rating,Review,source,racism,bullying,sexual
2,1,Get rid of micro transactions or i will find a new app to use. Why should i have to pay for that it’s so stupid,holla,0.0,0.0,0.0
6,1,This is good but most of my messages never show up. This is very crapy and needs to be fixed,skout,0.0,0.0,0.0
8,1,I was really enjoying this app. This brought me out of the box. I’m an extremely shy person and this gave me somewhere to talk to nice people. I just got kicked of bc I’m 16 not “18” and I think that this change it kind of stupid bc yeah it’s for protection but like someone else said all you hav...,holla,0.0,0.0,0.0
13,1,It won’t lemme go live or anything like I think you fixed it for everyone but me and now it says I’m banned for no reason I didn’t even do anything,holla,0.0,0.0,0.0
15,1,No real ppl all fake or no reply,skout,0.0,0.0,0.0
17,2,Can’t join live and can’t scroll through profile and tap on them. Please fix ASAP. Everyone’s having this problem,holla,0.0,0.0,0.0
19,2,Descent,skout,0.0,0.0,0.0
22,1,Can't even get the app to open...there's no way the app can be any good if they can't even get it to open up without issues,skout,0.0,0.0,0.0
23,1,Y,skout,0.0,0.0,0.0
24,2,Haven't met a single person on this app yet...,skout,0.0,0.0,0.0


## Tokenization (splitting) and vectorization (counting)

To be honest: this time we're secretly using a *slightly fancier method of counting.*

"Count" vectorization just counts words, while "TF-IDF" vectorization makes words that show up all the time - like `the` or `and` -  less important. Just because two tweets both have "the" and "and" in common doesn't mean they're similar at all!

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(known.Review)

# Build a dataframe of words, purely out of curiosity
words_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
words_df.head(3)



Unnamed: 0,000,10,100,13,14,15,16,17,18,19,1st,20,200,2000,215,22,24,25,250,2p,30,309,39,48,50,600,6500,77,7832,80,85,862,95,99,able,about,abruptly,absolutely,accept,access,account,accounts,accuse,act,acting,action,active,activo,acts,actual,...,work,worked,working,works,world,worried,worse,worst,worth,worthless,worthwhile,would,wouldn,wow,write,wrong,ya,yeah,year,years,yellow,yet,york,you,young,younger,your,yourself,youtube,youtuber,yubo,zero,اكثر,الاشخاص,البرنامج,القريبين,تحسين,حولنا,زيادة,قبل,للدردشة,للرمنسية,للعب,مخصص,مكان,من,نطاق,والصداقة,وضع,ومكان
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.115461,0.0,0.10512,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.14673,0.0,0.0,0.0,0.0,0.0,0.115461,0.0,0.0,0.0,0.0,0.0,0.141882,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Training

In [22]:
from sklearn.svm import LinearSVC

# Word counts + positive/negative
X = words_df
y = known.sexual

# Train a LinearSVC classifier
svc = LinearSVC()
svc.fit(X, y)

## Using our classifier

### Making predictions

In [23]:
# Count the words in the sentences from before
vectors = vectorizer.transform(unknown.Review)

# SVC predictions
unknown['pred_svc'] = svc.predict(vectors)
unknown['svc_score'] = svc.decision_function(vectors)



In [24]:
unknown.pred_svc.value_counts()

0.0    55712
1.0       13
Name: pred_svc, dtype: int64

### Seeing the results

In [25]:
unknown.sort_values(by='svc_score', ascending=False).head(25)

Unnamed: 0,Rating,Review,source,racism,bullying,sexual,pred_svc,svc_score
19428,2,All the guys on here ever ask for is nudes like I don't want to send my nudes to you,skout,,,,1.0,0.298642
53538,3,I like this app but there is so many horny guys and they are all 30 and asking for nudes,chat-for-strangers,,,,1.0,0.203618
18920,2,Buncha horn dogs (all guys want is sexy pics) and all girls do is show cleavage and their behind),skout,,,,1.0,0.123907
33958,5,I just want nudes,skout,,,,1.0,0.089558
20616,1,"Almost all the guys on the app ask girls for nudes and if you don't send them it they'll literally get upset and unfriend you on either snapchat or yellow itself, the people on there is shallow. 🤧",holla,,,,1.0,0.088913
39875,1,The only thing you’re going to get on the site is fake news there is asking me to go on to another site to pay to watch their nudes,skout,,,,1.0,0.080045
11002,1,To many perverts and all they ask for is nudes🙄,chat-for-strangers,,,,1.0,0.067204
55612,4,Like a small thing that's pink/ blue to show whether your m or f. That would be nice. And to all the guys out there... Put your dick back in your pants. I'm a guy but I don't creep on girls. There a thing called porn.,chat-for-strangers,,,,1.0,0.061739
22327,1,"The app is old men on there, guys harass you, they treat women on there like we want sex and they say they’ll pay you for sex smh. This app needs to be shut down a lot of creepy old guys and some creepy young guys. They don’t read your profile they just harass you over and over again. The women ...",skout,,,,1.0,0.048448
21071,1,I just want to say that all these guys downloading this or reviewing just to get girls to send nudes are asking to be trolled.,chat-for-strangers,,,,1.0,0.035922


## BONUS! Explaining our classifier

In [28]:
import eli5

eli5.show_weights(svc, vec=vectorizer, top=(6, 6))



Weight?,Feature
+1.255,nudes
+0.903,guys
+0.856,men
+0.809,thing
+0.787,jacking
+0.631,without
… 374 more positive …,… 374 more positive …
… 1120 more negative …,… 1120 more negative …
-0.321,people
-0.324,have


---

## Vocabulary check: We didn't learn anything new!!!

> `train a binary classification mode on the tokenized 'Review' column` using a huggingface transformer (uncased dilbert)"

We still don't know anything about the second part!

---


# Hello Hugging Face

This is... probably too much for a 30-45 minute talk. But [here's a GPT-2 example](https://huggingface.co/gpt2?text=Once+upon+a+time%2C) just for fun!