# Project 1
**Name**: Adriane Mikko Amorado<br>
**Course Name**: Solving Business Problems with NLP<br>
**Instructor**: Juber Rahman

For this project, I opted to use the dataset provided in the slack channel via google drive that was shared to the learners. I tried to predict if the review was written by a kid or a teen.


## Import tools

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Part A
Prepare a dataset for online games classification. Scrape game reviews from commonsense media https://www.commonsensemedia.org/game-reviews. You may use ParseHub software for the scraping following this tutorial https://www.parsehub.com/blog/web-scraper-tutorial/

Manually label each row (game review) as safe or adult

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
%cd "/content/drive/MyDrive/NLP for BP/week 1"

/content/drive/.shortcut-targets-by-id/11XCV4AIVZEkert2XR_xH0ZhxeSjnuqDA/NLP for BP/week 1


In [5]:
# read target class data
df = pd.read_csv('game_reviews.csv')

In [6]:
df

Unnamed: 0,selection1_name,selection1_selection2,selection1_selection3
0,"Teen, 17 years old",age 7+,Kids dont listent to the parents saying ''ThIs...
1,"Kid, 11 years old",age 2+,I have been playing this game for many years a...
2,"Kid, 12 years old",age 7+,The game is great with no true inappropriate t...
3,"Teen, 13 years old",age 5+,Are you sure you got common sense. I would giv...
4,"Kid, 10 years old",age 5+,IDK WHAT TO SAY BUT DIS IS DA BEST GAME EVA
...,...,...,...
556,"Teen, 14 years old",age 7+,"Wow. Just wow. Common Sense, THIS GAME IS IN A..."
557,"Teen, 13 years old",age 4+,A quite popular gem on the app store/google pl...
558,"Kid, 12 years old",age 2+,"It's a abhorrent try by Mojang, though a excel..."
559,"Teen, 15 years old",age 6+,Minecraft: Pocket Edition is an app which you ...


In [7]:
df['label'] = df['selection1_name'].str.split(',', n=1).str[0]

In [8]:
df.dropna(inplace=True)
df

Unnamed: 0,selection1_name,selection1_selection2,selection1_selection3,label
0,"Teen, 17 years old",age 7+,Kids dont listent to the parents saying ''ThIs...,Teen
1,"Kid, 11 years old",age 2+,I have been playing this game for many years a...,Kid
2,"Kid, 12 years old",age 7+,The game is great with no true inappropriate t...,Kid
3,"Teen, 13 years old",age 5+,Are you sure you got common sense. I would giv...,Teen
4,"Kid, 10 years old",age 5+,IDK WHAT TO SAY BUT DIS IS DA BEST GAME EVA,Kid
...,...,...,...,...
556,"Teen, 14 years old",age 7+,"Wow. Just wow. Common Sense, THIS GAME IS IN A...",Teen
557,"Teen, 13 years old",age 4+,A quite popular gem on the app store/google pl...,Teen
558,"Kid, 12 years old",age 2+,"It's a abhorrent try by Mojang, though a excel...",Kid
559,"Teen, 15 years old",age 6+,Minecraft: Pocket Edition is an app which you ...,Teen


## Part B
Train an NLP model for classification Split your data into train and test sets. Preprocess the reviews for tokenization/ stop word removals Prepare two set of embeddings using Bag-of-Words and TF-IDF Train a machine learning model to classify the reviews into safe and adult Evaluate your model on the test set. Compare the performance for each embeddings Upload your notebook in the course GitHub repo.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    df['selection1_selection3'], df['label'],
    random_state=42
)

In [10]:
def tokenizer(doc):
    """Tokenizes docs by applying pos with lemmatizer for each token"""
    tokenizer = RegexpTokenizer(r'(?u)\b(\w(?:\w|\-)+)\b')
    tokens = tokenizer.tokenize(doc)

    postags = [
        (token.lower(), 'a' if pos[0] == 'J' else pos[0].lower())
        for token, pos in nltk.pos_tag(tokens)
        if pos[0] in 'JNVR']
    
    lemmatizer = WordNetLemmatizer()
    lemmas = [lemmatizer.lemmatize(*t) for t in postags]

    return lemmas

### Bag-of-Words

In [11]:
bow_pipeline = Pipeline(
    [
        ("bow",
         CountVectorizer(tokenizer=tokenizer, stop_words=stopwords.words('english'))
        ),
        ("clf", GradientBoostingClassifier()),
    ]
)

In [12]:
bow_pipeline.fit(X_train, y_train)

  % sorted(inconsistent)


Pipeline(steps=[('bow',
                 CountVectorizer(stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourselves', 'you',
                                             "you're", "you've", "you'll",
                                             "you'd", 'your', 'yours',
                                             'yourself', 'yourselves', 'he',
                                             'him', 'his', 'himself', 'she',
                                             "she's", 'her', 'hers', 'herself',
                                             'it', "it's", 'its', 'itself', ...],
                                 tokenizer=<function tokenizer at 0x7fb8bd387050>)),
                ('clf', GradientBoostingClassifier())])

In [13]:
bow_pipeline.score(X_train, y_train), bow_pipeline.score(X_test, y_test)

(0.8730964467005076, 0.5454545454545454)

### TF-IDF

In [14]:
tfidf_pipeline = Pipeline(
    [
        ("tfidf",
         TfidfVectorizer(tokenizer=tokenizer, stop_words=stopwords.words('english'))
        ),
        ("clf", GradientBoostingClassifier()),
    ]
)

In [15]:
tfidf_pipeline.fit(X_train, y_train)

  % sorted(inconsistent)


Pipeline(steps=[('tfidf',
                 TfidfVectorizer(stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourselves', 'you',
                                             "you're", "you've", "you'll",
                                             "you'd", 'your', 'yours',
                                             'yourself', 'yourselves', 'he',
                                             'him', 'his', 'himself', 'she',
                                             "she's", 'her', 'hers', 'herself',
                                             'it', "it's", 'its', 'itself', ...],
                                 tokenizer=<function tokenizer at 0x7fb8bd387050>)),
                ('clf', GradientBoostingClassifier())])

In [16]:
tfidf_pipeline.score(X_train, y_train), tfidf_pipeline.score(X_test, y_test)

(0.9441624365482234, 0.5833333333333334)

## Reference
https://omdena.com/blog/internet-safety-children/