<a href="https://colab.research.google.com/github/mscandlen3/CS4650/blob/main/LanguageLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Analysis of Language Learning

*Madelyn Scandlen and Shivali Pandya*

This project seeks to perform text classification tasks on a Reddit data corpus and an Spanish learner essay corpus in order to see which language learners are the most successful.

There are two main features of the project. 

The first task is to perform supervised classification of Spanish learners into different levels of proficiency. 

The second task is to classify Spanish learners into motivation profiles and evaluate the relationship between the learners' motivation and their proficiency over time.

## Set Up

### Importing Data from Google Cloud

This step is loading the JSON data from its location in Google Cloud Storage into this Colab notebook.

In [1]:
from google.colab import auth
auth.authenticate_user()

In [2]:
!curl https://sdk.cloud.google.com | bash

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   443  100   443    0     0  13029      0 --:--:-- --:--:-- --:--:-- 13424
Downloading Google Cloud SDK install script: https://dl.google.com/dl/cloudsdk/channels/rapid/install_google_cloud_sdk.bash
######################################################################## 100.0%
Running install script from: /tmp/tmp.YWR53CbOCd/install_google_cloud_sdk.bash
which curl
curl -# -f https://dl.google.com/dl/cloudsdk/channels/rapid/google-cloud-sdk.tar.gz
######################################################################## 100.0%

mkdir -p /root
"/root/google-cloud-sdk" already exists and may contain out of date files.
Remove /root/google-cloud-sdk or select a new installation directory, then run again.


In [3]:
!gcloud init
# input 1, 1, optimal-bivouac-330014

Welcome! This command will take you through the configuration of gcloud.

Settings from your current configuration [default] are:
component_manager:
  disable_update_check: 'True'
compute:
  gce_metadata_read_timeout_sec: '0'
core:
  account: mscandlen12@gmail.com
  project: optimal-bivouac-330014

Pick configuration to use:
 [1] Re-initialize this configuration [default] with new settings 
 [2] Create a new configuration
Please enter your numeric choice:  1

Your current configuration has been set to: [default]

You can skip diagnostics next time by using the following flag:
  gcloud init --skip-diagnostics

Network diagnostic detects and fixes local network connection issues.
Reachability Check passed.
Network diagnostic passed (1/1 checks passed).

Choose the account you would like to use to perform operations for this 
configuration:
 [1] mscandlen12@gmail.com
 [2] Log in with a new account
Please enter your numeric choice:  1

You are logged in as: [mscandlen12@gmail.com].

Pick c

In [4]:
!gsutil cp gs://language_learning_subreddit/2019_SUBREDDITS=learnspanish,spanish.gz gs://language_learning_subreddit/famous.F17.csv gs://language_learning_subreddit/reddit_tagged.csv .

Copying gs://language_learning_subreddit/2019_SUBREDDITS=learnspanish,spanish.gz...
Copying gs://language_learning_subreddit/famous.F17.csv...
Copying gs://language_learning_subreddit/reddit_tagged.csv...
- [3 files][ 15.0 MiB/ 15.0 MiB]                                                
Operation completed over 3 objects/15.0 MiB.                                     


### Uploading Data to a DataFrame

In [5]:
import pandas as pd
import os
import csv
import json
import gzip

import re
import string

#### Essay Data

In this section, we will build a corpus of Spanish words and sentences that are used by non-native spanish writers (also called L2 speakers/writers). We will use the UC Davis Corpus of Written Spanish, L2 and Heritage Speakers (COWSL2H). This corpus contains essays on the following essay prompts that were given to spanish students at UC Davis: "famous person", "your perfect vacation plan", "a special person in your life", and "a terrible story".

In [6]:
df_essay = pd.read_csv('./famous.F17.csv', index_col=0)
df_essay

Unnamed: 0,id,prompt,quarter,course,age,gender,l1 language,other l1 language(s),language(s) used at home,language(s) studied,listening comprehension,reading comprehension,speaking ability,writing ability,study abroad,essay,a personal annotator1,a personal annotator2,gender-number annotator1,gender-number annotator2,corrected
0,146362,famous,F17,SPA 2,19,Female,English,,,,3.0,3.0,1.0,1.0,No,Una persona famosa que admiro es Lauren Jaureg...,Una persona famosa que admiro es Lauren Jaureg...,Una persona famosa que admiro es Lauren Jaureg...,Una persona famosa que admiro es Lauren Jaureg...,Una persona famosa que admiro es Lauren Jaureg...,Una persona famosa que admiro es Lauren Jaureg...
1,104622,famous,F17,SPA 3,20,Female,English,Not Applicable,No,Not Applicable,2.0,3.0,3.0,3.0,No,Yo veo un programa de television que es muy di...,Yo veo un programa de television que es muy di...,Yo veo un programa de television que es muy di...,Yo veo un programa de television que es muy di...,Yo veo un programa de television que es muy di...,Yo veo un programa de televisión que es muy di...
2,169693,famous,F17,SPA 24,18 as of April 2017,Female,English,,,,3.0,3.0,2.0,3.0,No,Antes de contarles de una persona famosa quien...,Antes de contarles de una persona famosa quien...,Antes de contarles de una persona famosa quien...,Antes de contarles de una persona famosa quien...,Antes de contarles de una persona famosa quien...,Antes de contarles sobre una persona famosa qu...
3,179355,famous,F17,SPA 1,20,Female,Other,Japanese,Japanese,English more than 10 years,2.0,3.0,1.0,2.0,No,Voy a prensentar una chica famosa en Japón. S...,Voy a prensentar una chica famosa en Japón. S...,Voy a prensentar []{a}<az:do:an> una chica fa...,Voy a prensentar una chica famosa en Japón. S...,Voy a prensentar una chica famosa en Japón. S...,Voy a presentar a una chica famosa en Japón. S...
4,148244,famous,F17,SPA 3,19,Male,Mandarin,,I speak mandarin at home,English 12 years,3.0,4.0,4.0,5.0,No,Mi cantante favorita es Taylor Swift. Me gusta...,Mi cantante favorita es Taylor Swift. Me gusta...,Mi cantante favorita es Taylor Swift. Me gusta...,Mi cantante favorita es Taylor Swift. Me gusta...,Mi cantante favorita es Taylor Swift. Me gusta...,Mi cantante favorita es Taylor Swift. Me gusta...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
170,140109,famous,F17,SPA 3,19,Male,English,,no,no,3.0,4.0,3.0,4.0,Yes,Una persona Famoso: Nathan Fielder es una pers...,Una persona Famoso: Nathan Fielder es una pers...,Una persona Famoso: Nathan Fielder es una pers...,Una persona [Famoso]{famosa}<ga:fm:adj:an>: Na...,Una persona [Famoso]{famosa}<ga:fm:adj:an>: Na...,Una persona famosa: Nathan Fielder es una pers...
171,185606,famous,F17,SPA 23,23,Female,English,,no,,2.0,3.0,2.0,3.0,No,Una persona famosa quien yo pienso es muy asom...,Una persona famosa quien yo pienso es muy asom...,Una persona famosa quien yo pienso es muy asom...,Una persona famosa quien yo pienso es muy asom...,Una persona famosa quien yo pienso es muy asom...,Una persona famosa muy asombrosa es Helen Kell...
172,156764,famous,F17,SPA 24,19,Female,English,,,,3.0,3.0,2.0,2.0,No,Selena Gomez: Persona famosa\n\nSelena Gomez e...,Selena Gomez: Persona famosa\n\nSelena Gomez e...,Selena Gomez: Persona famosa\n\nSelena Gomez e...,Selena Gomez: Persona famosa\n\nSelena Gomez e...,Selena Gomez: Persona famosa\n\nSelena Gomez e...,Selena Gomez: Persona famosa. Selena Gomez es ...
173,172630,famous,F17,SPA 1,18,Female,English,none...,Spanish.,none...,2.0,3.0,2.0,2.0,No,Una famosa persona que yo encanta es Sabrina C...,Una famosa persona que yo encanta es Sabrina C...,Una famosa persona que yo encanta es Sabrina C...,Una famosa persona que yo encanta es Sabrina C...,Una famosa persona que yo encanta es Sabrina C...,Una famosa persona que me encanta es Sabrina C...


#### Reddit Data

The data was obtained from scraping different subreddits (r/Spanish, r/LearnSpanish) from the year 2019 and creating JSON units for each post, from [redditsearch.io](https://https://www.redditsearch.io/). Posts are both posts to the subreddit and comments. The JSONs include information such as author_tag, score (number of upvotes), and the text content of the post. 

To classify users, we will reorganize the JSONs to contain all posts for a user, retaining the score and the comments by other users. 

In [7]:
subreddit_list = []

with gzip.open('./2019_SUBREDDITS=learnspanish,spanish.gz') as f:
  for obj in f:
    post = json.loads(obj)
    subreddit_list.append(post)

In [8]:
df_reddit = pd.DataFrame(subreddit_list)
df_reddit

Unnamed: 0,author,author_created_utc,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,body,can_gild,can_mod_post,collapsed,collapsed_reason,controversiality,created_utc,distinguished,edited,gilded,gildings,id,is_submitter,link_id,no_follow,parent_id,permalink,removal_reason,retrieved_on,score,send_replies,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_type,author_cakeday,quarantined,locked,all_awardings,total_awards_received,steward_reports,awarders,associated_award,collapsed_because_crowd_control,author_premium
0,gloix,1.314202e+09,,,[],,,,text,t2_5q1bq,False,You can say that when you've already ordered a...,True,False,False,,0,1546300964,,False,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",eczb5ii,False,t3_aaz2g9,True,t1_ecygmis,/r/Spanish/comments/aaz2g9/when_ordering_food_...,,1550712660,1,True,False,Spanish,t5_2qtt1,r/Spanish,public,,,,,,,,,,
1,garbagecoder,1.434302e+09,,second,"[{'e': 'text', 't': 'C1'}]",,C1,dark,richtext,t2_o3ucv,False,Thank you. There are some language where the l...,True,False,False,,0,1546301446,,False,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",eczbpj8,True,t3_ab9dr8,True,t1_ecz24yd,/r/Spanish/comments/ab9dr8/data_on_spanish_for...,,1550712907,2,True,False,Spanish,t5_2qtt1,r/Spanish,public,,,,,,,,,,
2,gatosol,1.477917e+09,#46d160,native,"[{'e': 'text', 't': '🇮🇨 Canarias (África) 🐱 Na...",ed67a04a-9a87-11e2-9ee1-12313b06caaf,🇮🇨 Canarias (África) 🐱 Native Spanish,light,richtext,t2_12hxof,False,Fast? \n\nhttp://www.youtube.com/watch?v=-W2NP...,True,False,False,,0,1546301848,,False,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",eczc6fx,False,t3_abcgjr,False,t3_abcgjr,/r/Spanish/comments/abcgjr/fast_speaking_youtu...,,1550713115,3,True,False,Spanish,t5_2qtt1,r/Spanish,public,,,,,,,,,,
3,gloix,1.314202e+09,,,[],,,,text,t2_5q1bq,False,"Chile? I have never heard someone say ""me pone...",True,False,False,,0,1546301907,,False,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",eczc8r2,False,t3_aaz2g9,True,t1_ecwk6pu,/r/Spanish/comments/aaz2g9/when_ordering_food_...,,1550713145,1,True,False,Spanish,t5_2qtt1,r/Spanish,public,,,,,,,,,,
4,brog88,1.499983e+09,,,[],,,,text,t2_6z3h9m2,False,¡Feliz Año Nuevo! También he estado viendo las...,True,False,False,,0,1546302339,,False,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",eczcqr0,False,t3_abcugt,False,t3_abcugt,/r/Spanish/comments/abcugt/feliz_año_nuevo_201...,,1550713397,9,True,False,Spanish,t5_2qtt1,r/Spanish,public,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79886,Goatlessly,1.532500e+09,,,[],,,,text,t2_1ulehyr1,False,Te lo resumo,True,False,False,,0,1569887499,,False,0,{},f22emdg,False,t3_dbff5h,True,t3_dbff5h,/r/learnspanish/comments/dbff5h/can_someone_re...,,1578011733,2,True,False,learnspanish,t5_2rd6d,r/learnspanish,public,,False,False,[],0.0,[],[],,,False
79887,stvbeev,1.513914e+09,,,[],,,,text,t2_4iyofhk,False,"It’s usually presented as a dichotomy, but eac...",True,False,False,,0,1569887598,,False,0,{},f22es6l,False,t3_dbkdc0,True,t3_dbkdc0,/r/Spanish/comments/dbkdc0/what_are_some_diffe...,,1578011813,7,True,False,Spanish,t5_2qtt1,r/Spanish,public,,False,False,[],0.0,[],[],,,False
79888,Rumope,1.397216e+09,,,[],,,,text,t2_g2tva,False,"Some sentences like ""A comer"" or ""A la playa"" ...",True,False,False,,0,1569887809,,False,0,{},f22f51w,False,t3_db5g7n,True,t1_f227n41,/r/Spanish/comments/db5g7n/shortcut_phrases_li...,,1578011979,2,True,False,Spanish,t5_2qtt1,r/Spanish,public,,False,False,[],0.0,[],[],,,False
79889,KingsElite,1.343486e+09,#e87500,,"[{'e': 'text', 't': 'Perfecting it'}]",1b663862-23b7-11e4-9a5d-12313b0e94e7,Perfecting it,light,richtext,t2_8hd71,False,I studied in Xela in 2017 and felt perfectly s...,True,False,False,,0,1569887898,,False,0,{},f22fac8,False,t3_dbdhx0,False,t3_dbdhx0,/r/learnspanish/comments/dbdhx0/spanish_school...,,1578012045,5,True,False,learnspanish,t5_2rd6d,r/learnspanish,public,,False,False,[],0.0,[],[],,,False


In [9]:
df_tagged = pd.read_csv('./reddit_tagged.csv', index_col=0)

## Task 1: Classifying Users by Proficiency Level




### Preprocessing the Data

In [10]:
df_essay

Unnamed: 0,id,prompt,quarter,course,age,gender,l1 language,other l1 language(s),language(s) used at home,language(s) studied,listening comprehension,reading comprehension,speaking ability,writing ability,study abroad,essay,a personal annotator1,a personal annotator2,gender-number annotator1,gender-number annotator2,corrected
0,146362,famous,F17,SPA 2,19,Female,English,,,,3.0,3.0,1.0,1.0,No,Una persona famosa que admiro es Lauren Jaureg...,Una persona famosa que admiro es Lauren Jaureg...,Una persona famosa que admiro es Lauren Jaureg...,Una persona famosa que admiro es Lauren Jaureg...,Una persona famosa que admiro es Lauren Jaureg...,Una persona famosa que admiro es Lauren Jaureg...
1,104622,famous,F17,SPA 3,20,Female,English,Not Applicable,No,Not Applicable,2.0,3.0,3.0,3.0,No,Yo veo un programa de television que es muy di...,Yo veo un programa de television que es muy di...,Yo veo un programa de television que es muy di...,Yo veo un programa de television que es muy di...,Yo veo un programa de television que es muy di...,Yo veo un programa de televisión que es muy di...
2,169693,famous,F17,SPA 24,18 as of April 2017,Female,English,,,,3.0,3.0,2.0,3.0,No,Antes de contarles de una persona famosa quien...,Antes de contarles de una persona famosa quien...,Antes de contarles de una persona famosa quien...,Antes de contarles de una persona famosa quien...,Antes de contarles de una persona famosa quien...,Antes de contarles sobre una persona famosa qu...
3,179355,famous,F17,SPA 1,20,Female,Other,Japanese,Japanese,English more than 10 years,2.0,3.0,1.0,2.0,No,Voy a prensentar una chica famosa en Japón. S...,Voy a prensentar una chica famosa en Japón. S...,Voy a prensentar []{a}<az:do:an> una chica fa...,Voy a prensentar una chica famosa en Japón. S...,Voy a prensentar una chica famosa en Japón. S...,Voy a presentar a una chica famosa en Japón. S...
4,148244,famous,F17,SPA 3,19,Male,Mandarin,,I speak mandarin at home,English 12 years,3.0,4.0,4.0,5.0,No,Mi cantante favorita es Taylor Swift. Me gusta...,Mi cantante favorita es Taylor Swift. Me gusta...,Mi cantante favorita es Taylor Swift. Me gusta...,Mi cantante favorita es Taylor Swift. Me gusta...,Mi cantante favorita es Taylor Swift. Me gusta...,Mi cantante favorita es Taylor Swift. Me gusta...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
170,140109,famous,F17,SPA 3,19,Male,English,,no,no,3.0,4.0,3.0,4.0,Yes,Una persona Famoso: Nathan Fielder es una pers...,Una persona Famoso: Nathan Fielder es una pers...,Una persona Famoso: Nathan Fielder es una pers...,Una persona [Famoso]{famosa}<ga:fm:adj:an>: Na...,Una persona [Famoso]{famosa}<ga:fm:adj:an>: Na...,Una persona famosa: Nathan Fielder es una pers...
171,185606,famous,F17,SPA 23,23,Female,English,,no,,2.0,3.0,2.0,3.0,No,Una persona famosa quien yo pienso es muy asom...,Una persona famosa quien yo pienso es muy asom...,Una persona famosa quien yo pienso es muy asom...,Una persona famosa quien yo pienso es muy asom...,Una persona famosa quien yo pienso es muy asom...,Una persona famosa muy asombrosa es Helen Kell...
172,156764,famous,F17,SPA 24,19,Female,English,,,,3.0,3.0,2.0,2.0,No,Selena Gomez: Persona famosa\n\nSelena Gomez e...,Selena Gomez: Persona famosa\n\nSelena Gomez e...,Selena Gomez: Persona famosa\n\nSelena Gomez e...,Selena Gomez: Persona famosa\n\nSelena Gomez e...,Selena Gomez: Persona famosa\n\nSelena Gomez e...,Selena Gomez: Persona famosa. Selena Gomez es ...
173,172630,famous,F17,SPA 1,18,Female,English,none...,Spanish.,none...,2.0,3.0,2.0,2.0,No,Una famosa persona que yo encanta es Sabrina C...,Una famosa persona que yo encanta es Sabrina C...,Una famosa persona que yo encanta es Sabrina C...,Una famosa persona que yo encanta es Sabrina C...,Una famosa persona que yo encanta es Sabrina C...,Una famosa persona que me encanta es Sabrina C...


We will first start by taking the mean of the listeners' ability scores and rounding that be an integer score. This will be used as our ordered, categorical labels for proficiency classification, {1,2,3,4,5} with 5 being most proficient and 1 being least proficient.

In [11]:
df_e = df_essay
df_e['score'] = df_essay[['listening comprehension', 'reading comprehension', 'speaking ability', 'writing ability']].mean(axis=1)
df_e = df_e.dropna(axis=0, subset=['score'])
df_e['score'] = df_e['score'].round(0).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [12]:
df_e = df_e[['id', 'course', 'essay', 'score']]
df_e

Unnamed: 0,id,course,essay,score
0,146362,SPA 2,Una persona famosa que admiro es Lauren Jaureg...,2
1,104622,SPA 3,Yo veo un programa de television que es muy di...,3
2,169693,SPA 24,Antes de contarles de una persona famosa quien...,3
3,179355,SPA 1,Voy a prensentar una chica famosa en Japón. S...,2
4,148244,SPA 3,Mi cantante favorita es Taylor Swift. Me gusta...,4
...,...,...,...,...
170,140109,SPA 3,Una persona Famoso: Nathan Fielder es una pers...,4
171,185606,SPA 23,Una persona famosa quien yo pienso es muy asom...,2
172,156764,SPA 24,Selena Gomez: Persona famosa\n\nSelena Gomez e...,2
173,172630,SPA 1,Una famosa persona que yo encanta es Sabrina C...,2


We are also extracting the Spanish level course number using Regex. We are only extracting the first digit to have three courses {1,2,3}.

In [13]:
df_e['course'] = df_e['course'].str.replace(r"(SPA\s)(\d)([0-9]*)", r"\2")
df_e

Unnamed: 0,id,course,essay,score
0,146362,2,Una persona famosa que admiro es Lauren Jaureg...,2
1,104622,3,Yo veo un programa de television que es muy di...,3
2,169693,2,Antes de contarles de una persona famosa quien...,3
3,179355,1,Voy a prensentar una chica famosa en Japón. S...,2
4,148244,3,Mi cantante favorita es Taylor Swift. Me gusta...,4
...,...,...,...,...
170,140109,3,Una persona Famoso: Nathan Fielder es una pers...,4
171,185606,2,Una persona famosa quien yo pienso es muy asom...,2
172,156764,2,Selena Gomez: Persona famosa\n\nSelena Gomez e...,2
173,172630,1,Una famosa persona que yo encanta es Sabrina C...,2


### Creating LSTM to do Document Classification

Following the tutorial to create a Bidirectional LSTM from https://towardsdatascience.com/multi-class-text-classification-with-lstm-using-tensorflow-2-0-d88627c10a35 

In [14]:
import numpy as np

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Bidirectional, Embedding

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

STOPWORDS = set(stopwords.words('english') + stopwords.words('spanish'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


First we iterate over every essay and remove the stopwords and append it to our docs list.

In [15]:
docs_e = []
for doc in df_e['essay']:
  for word in STOPWORDS:
    token = ' ' + word + ' '
    doc = doc.replace(token, ' ')
    doc = doc.replace(' ', ' ')
  docs_e.append(doc)
print(len(docs_e))

164


We are subtracting 1 from the labels so that our classes begin at 0. We will then one-hot encode our classes, assuming they resemble Likert scale measures of proficiency.

In [16]:
labels_e = df_e['score'] - 1
print(len(labels_e))
print(labels_e[:5])

164
0    1
1    2
2    2
3    1
4    3
Name: score, dtype: int64


In [17]:
labels_encoded = to_categorical(labels_e)
print(len(labels_encoded))
print(labels_encoded[:5])

164
[[0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0.]]


In [18]:
print(set(df_e['score'] - 1))

{0, 1, 2, 3, 4}


In [19]:
# hyperparameters
vocab_size = 500
embedding_dim = 64
max_length = 200
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'
training_portion = .8

We will print out a list of all the word tokens and their indices.

In [20]:
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(docs_e)
word_index = tokenizer.word_index
dict(list(word_index.items())[0:20])

{'<OOV>': 1,
 'años': 9,
 'canciones': 15,
 'dos': 16,
 'el': 3,
 'ella': 2,
 'en': 12,
 'famosa': 7,
 'famoso': 18,
 'gusta': 5,
 'muchas': 11,
 'música': 13,
 'película': 17,
 'persona': 6,
 'personas': 8,
 'ser': 14,
 'su': 10,
 'también': 20,
 'vida': 19,
 'él': 4}

In [21]:
padded_docs_e = pad_sequences(tokenizer.texts_to_sequences(docs_e), maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [22]:
label_seq_e = np.array(labels_encoded)

In [23]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_dim))
model.add(Bidirectional(LSTM(embedding_dim)))
model.add(Dropout(0.2))
model.add(Dense(5, activation='softmax'))
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 64)          32000     
                                                                 
 bidirectional (Bidirectiona  (None, 128)              66048     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 5)                 645       
                                                                 
Total params: 98,693
Trainable params: 98,693
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(padded_docs_e, label_seq_e, epochs=20, validation_split=0.2)

Epoch 1/20


In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

The results of the LSTM performing supervised classification.

## Task 2: Figuring out how to classify Learners by Motivation

### Pre-Processing Data

In [None]:
df_reddit

First, we will specify which columns to keep, dropping unimportant information to our task about background text and gildings. Then we are dropping posts that have a deleted author since we won't be able to connect it to other posts. We are also adding a column to reformat the timestamp when the post was created. We are then going to index by the author id to keep a hierarchy.

In [None]:
df_r = df_reddit[['author', 'author_fullname', 'author_flair_text', 'id', 'body', 'created_utc', 'is_submitter', 'link_id', 'no_follow', 'parent_id', 'permalink', 'score', 'subreddit', 'subreddit_id']]
df_r = df_r.dropna(axis=0, subset=['author_fullname', 'author'])

df_r = df_r.sort_values(['author_fullname'])
df_r = df_r.set_index(['author_fullname', 'id'])
df_r

We are adding a timestamp to our posts so that we can have understanding of when the post was made.

In [None]:
from datetime import datetime, timezone

utcs = (df_r['created_utc'].astype(int))
ts = []
month = []
for u in utcs:
   t = datetime.fromtimestamp(u, tz=timezone.utc)
   ts.append(t)
   month.append(datetime.strftime(t, '%m'))

df_r['timestamp'] = ts
df_r['month'] = month

### Looking at Data

#### Summary Data

Now we are going to look at summary statistics for the data.

In [None]:
print("There are", df_r.author.nunique(), "users in the dataset.\n")

subreddit_user_count = list(df_r['subreddit'].value_counts())

print("There are", subreddit_user_count[0], "posts from the r/Spanish subreddit")
print("There are", subreddit_user_count[1], "posts from the r/learninspanish subreddit")

In [None]:
# Most common author flairs
df_r['author_flair_text'].value_counts()[:10]

In [None]:
# Average posts by a user
df_r.groupby(['author']).size().mean()

In [None]:
# Average length of a text post in characters
df_r['body'].str.len().mean()

#### Sample User Data

Here we can isolate a random user ('gatosol', 't2_12hxof) and try to get an idea of what information we get from their profile and posts.

In [None]:
user_sample = df_r.loc['t2_12hxof'].reset_index()
user_sample

We see that from the author_flair_text that they are a native speaker of Spanish, and therefore are not learning. We will eventually have to throw out this user, but we can do this with NLP processing by classifying a user as a learner or a native speaker.

Let's look at a user that we know is a learner, user LangGeek ('t2_zzree'). Their flair identifies them as a B2 learner which means that they are "vantage or upper intermediate" level.

In [None]:
user_sample = df_r.loc['t2_zzree'].reset_index()
user_sample

Let's look at their post (id='eldx5a') to see what information they've shared about their learning.

In [None]:
for i in range(len(user_sample)):
  post = user_sample.iloc[i]
  print(post['id'], ": ", post['body'], "\n")

We see that they've been learning for 7 years ('eldx5a0'), that they "speak like an Argentinian" but have a hard time understanding other Argentinians ('eibf8z7'), and that they talk to themselves in their head to faciliate learning ('euotx3g'). These comments also highlight differences in learning Spanish that is spoken in different countries.

There's not much about user 't2_zzree's personal motivation in this post though there is an implication of learning for someone that the user is interested in communicating with.

#### Extracting out Learners

In [None]:
df_r.fillna("",inplace=True)

df_r['author_flair_text'] = df_r['author_flair_text'].str.lower()

We are performing regex pattern matching to find a subset of users that are blatantly identified as learners.

In [None]:
patterns = [r"learner", r"heritage", r"native",r"[a-z]{1}[0-2]{1}", r"student", r"beginner", r"intermediate", r"advanced"]

flair_patterns = '|'.join(patterns)
df_r['learner'] =  df_r['author_flair_text'].str.contains(flair_patterns)
df_r

In [None]:
df_learner = df_r.loc[df_r['learner']]
df_learner = df_learner.drop(columns=['learner'])
df_learner.set_index('author')

In [None]:
out = open('reddit_learners.csv', 'w')
df_learner.to_csv(out)

#### Manually annotating posts with motivation

1. Go through posts and classify them as hasMotivation (true, false) based on having the above regex words.
2. Then from posts that do have motivation, combine to be one document for a user.
3. Transform the posts to vectors using word2vec
4. perform K-means clustering on the users' posts in vector space
5. Use Elbow criterion to find best clusters

We are now importing a CSV that was manually annotated to assign each post to a motivation profile. The meanings of the labels are as follows:



0.   No mention of motivation
1.   Culture
2.   Dating & Family
3.   School & Lessons
4.   Career
5.   Travel
6.   Heritage

In [None]:
df_reddit_tagged = pd.read_csv('reddit_tagged.csv')

In [None]:
print(df_learner.shape)
print(df_reddit_tagged.shape)

We are using about 5% of posts from learners that have annotations. This could potentially be used for semi-supervised learning if our model fits to supervised learning well.

In [None]:
df_tagged = df_reddit_tagged[['author','author_flair_text','body','created_utc','class']]
df_tagged = df_tagged.dropna()
df_tagged['class'] = df_tagged['class'].astype(int)
df_tagged

In [None]:
# Most common author flairs
df_tagged['author_flair_text'].value_counts()[:10]

In [None]:
user_sample = df_tagged.loc[df_tagged['author'] == 'oldskoolgeometro']
user_sample

Here we look at user 'oldskoolgeometro' who has identified themselves as a "beginner" and "learner (a1)".

Most of their posts are tagged 0 for no motivation, but the below post is labeled for class 5: "travel".

In [None]:
print(user_sample.loc[16]['body'])
print(user_sample.loc[16]['class'])

We will drop all posts that are labeled 0 for having no motivation. We are then going to subtract 1 from all labels to start our classes at 0.

In [None]:
df_motiv = df_tagged.where(df_tagged['class'] > 0, None)
df_motiv = df_motiv.dropna()

In [None]:
df_motiv['class'] = df_motiv['class'] - 1
df_motiv

In [None]:
import seaborn as sns

sns.countplot(x='class', data=df_motiv)

Here we have classes represented.

0.   Culture
1.   Dating & Family
2.   School & Lessons
3.   Career
4.   Travel
5.   Heritage

### Initial Analysis

First we are going to clean up the text data.

In [None]:
df = df_motiv

In [None]:
texts=[]
for text in df['body']:
  text = re.sub(r"[{}]".format(string.punctuation), " ", text.lower())
  text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text)
  text = re.sub(r'\s+', ' ', text, flags=re.I)
  texts.append(text)

df['body'] = texts
df

In [None]:
labels = np.asarray(df['class']).astype(int)
labels[:5]

#### Vectorizing with Tf-Idf

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvect = TfidfVectorizer(max_features=2000, min_df=5, max_df=0.8, stop_words=['english','spanish'], use_idf=True)
tfidf = tfidfvect.fit_transform(df['body'])

print(tfidf.shape)

We have 222 samples that have 535 features.

In [None]:
print(df.loc[0]['body'])
x = pd.DataFrame(tfidf[0].T.todense(), index = tfidfvect.get_feature_names_out(), columns = ['tfidf'])
x = x.sort_values(by = ['tfidf'], ascending=False)
print(x[:5])
print("\nLabel: ", labels[0])

#### Scatterplot Using Actual Labels

In [None]:
from sklearn.decomposition import PCA

In [None]:
# We train the PCA on the dense version of the tf-idf. 
pca = PCA(n_components=len(labels))
two_dim = pca.fit_transform(tfidf.todense())

scatter_x = two_dim[:, 0] # first principle component
scatter_y = two_dim[:, 1] # second principle component

In [None]:
plt.style.use('ggplot')

fig, ax = plt.subplots()
fig.set_size_inches(20,10)

# color map for NUMBER_OF_CLUSTERS we have
cmap = {0: 'green', 1: 'blue', 2: 'red', 3: 'yellow', 4: "pink", 5:"black"}

# group by clusters and scatter plot every cluster
# with a colour and a label
for group in np.unique(labels):
    ix = np.where(labels == group)
    ax.scatter(scatter_x[ix], scatter_y[ix], c=cmap[group], label=group)

ax.legend()
plt.xlabel("PCA 0")
plt.ylabel("PCA 1")
plt.show()

#### K-Means Clustering & PCA

In [None]:
from sklearn.cluster import KMeans

In [None]:
num=6

kmeans = KMeans(n_clusters = num, init='k-means++', max_iter = 15).fit(tfidf)
print(kmeans.cluster_centers_)

In [None]:
predicting = [df.iloc[0]['body'], df.iloc[1]['body'], df.iloc[2]['body'], df.iloc[3]['body']]
print("author ", df.iloc[0]['author'], ": ", df.iloc[0]['body'])
print("author ", df.iloc[1]['author'], ": ", df.iloc[1]['body'])
print("author ", df.iloc[2]['author'], ": ", df.iloc[2]['body'])
print("author ", df.iloc[3]['author'], ": ", df.iloc[3]['body'])
pred = kmeans.predict(tfidfvect.transform(predicting))
actual = [labels[0], labels[1], labels[2], labels[3]]

print("\nPredicted Labels by K-Means: ", pred)
print("\nActual Labels: ", actual)

K-Means is not trying to predict labels, just clustering, but it did not cluster the same labels that the annotators had.

In [None]:
# First: for every document we get its corresponding cluster
clusters = kmeans.predict(tfidf)

In [None]:
preds = pd.DataFrame(clusters, columns=['cluster'])
sns.countplot(x='cluster', data=preds)

In [None]:
# We train the PCA on the dense version of the tf-idf. 
pca = PCA(n_components=num)
two_dim = pca.fit_transform(tfidf.todense())

scatter_x = two_dim[:, 0] # first principle component
scatter_y = two_dim[:, 1] # second principle component

In [None]:
print(pca.explained_variance_)

In [None]:
# plot the cumulative explained variance
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
plt.show()

In [None]:
# Plot the explained variances
features = range(num)
plt.bar(features, pca.explained_variance_ratio_, color='black')
plt.xlabel('PCA features')
plt.ylabel('variance %')
plt.xticks(features)
plt.show()

In [None]:
#Visualize the first two components
PCA_components = pd.DataFrame(two_dim)
plt.scatter(PCA_components[0], PCA_components[1], alpha=.1, color='black')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.show()

In [None]:
#Visualize the 3rd and 4th components
#this seems to have some sort of 2 dense areas
PCA_components = pd.DataFrame(two_dim)
plt.scatter(PCA_components[2], PCA_components[3], alpha=.1, color='black')
plt.xlabel('PCA 3')
plt.ylabel('PCA 4')
plt.show()

In [None]:
distortions = []
for k in range(1,10):
    model = KMeans(n_clusters=k)
    model.fit(tfidf)
    distortions.append(model.inertia_)

plt.figure(figsize=(16,8))
plt.plot(range(1,10), distortions, 'bx-')
plt.ylabel('Distortion')
plt.xlabel('k value')
plt.show()

In [None]:
plt.style.use('ggplot')

fig, ax = plt.subplots()
fig.set_size_inches(20,10)

# color map for NUMBER_OF_CLUSTERS we have
cmap = {0: 'green', 1: 'blue', 2: 'red', 3: 'yellow', 4: "pink", 5:"black"}

# group by clusters and scatter plot every cluster
# with a colour and a label
for group in np.unique(clusters):
    ix = np.where(clusters == group)
    ax.scatter(scatter_x[ix], scatter_y[ix], c=cmap[group], label=group)

ax.legend()
plt.xlabel("PCA 0")
plt.ylabel("PCA 1")
plt.show()

In [None]:
#plot without group 0
plt.style.use('ggplot')

fig, ax = plt.subplots()
fig.set_size_inches(20,10)

# color map for NUMBER_OF_CLUSTERS we have
cmap = {0: 'green', 1: 'blue', 2: 'red', 3: 'yellow', 4: "pink", 5:"black"}

# group by clusters and scatter plot every cluster
# with a colour and a label
for group in range(1,6):
    ix = np.where(clusters == group)
    ax.scatter(scatter_x[ix], scatter_y[ix], c=cmap[group], label=group)

ax.legend()
plt.xlabel("PCA 0")
plt.ylabel("PCA 1")
plt.show()

In [None]:
# can see a more clear split between certain groups like 
plt.style.use('ggplot')

fig, ax = plt.subplots()
fig.set_size_inches(20,10)

# color map for NUMBER_OF_CLUSTERS we have
cmap = {0: 'green', 1: 'blue', 2: 'red', 3: 'yellow', 4: "pink", 5:"black"}

# group by clusters and scatter plot every cluster
# with a colour and a label
for group in [3,4,5]:
    ix = np.where(clusters == group)
    ax.scatter(scatter_x[ix], scatter_y[ix], c=cmap[group], label=group)

ax.legend()
plt.xlabel("PCA 0")
plt.ylabel("PCA 1")
plt.show()

### Modeling

We are going to begin with a Bag-of-Words approach to predict motivation labels for each post. We will investigate Naive Bayes and Logistic Regression as models. Then we will train a sequential model.

In [None]:
print(tfidf.shape)
print(labels.shape)

In [None]:
from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
# from sklearn.metrics import roc_curve, auc, roc_auc_score

#### Naive Bayes

In [None]:
X_train, X_test, y_train, y_test = train_test_split(tfidf, labels, test_size=0.2, shuffle=True)

In [None]:
#Naive Bayes: Baseline Model with no smoothing

nb_tfidf = MultinomialNB(alpha = 0)
nb_tfidf.fit(X_train, y_train)
y_predict = nb_tfidf.predict(X_test)
y_prob = nb_tfidf.predict_proba(X_test)[:,1]

print(classification_report(y_test,y_predict))
mat = confusion_matrix(y_test, y_predict)
print('Confusion Matrix:\n', mat)

sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('predicted label')
plt.ylabel('true label')

In [None]:
#hyperparameter tuning
#Naive Bayes: with add-1 smoothing

nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train, y_train)
y_predict = nb_tfidf.predict(X_test)
y_prob = nb_tfidf.predict_proba(X_test)[:,1]
print(classification_report(y_test,y_predict))
mat = confusion_matrix(y_test, y_predict)
print('Confusion Matrix:\n', mat)

sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('predicted label')
plt.ylabel('true label')

In [None]:
#Naive Bayes: with add-1 smoothing
#try a 70-30 train-test split
X_train, X_test, y_train, y_test = train_test_split(tfidf, labels, test_size=0.3, shuffle=True)

nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train, y_train)
y_predict = nb_tfidf.predict(X_test)
y_prob = nb_tfidf.predict_proba(X_test)[:,1]
print(classification_report(y_test,y_predict))
mat = confusion_matrix(y_test, y_predict)
print('Confusion Matrix:\n', mat)

sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('predicted label')
plt.ylabel('true label')

#### Logistic Regression

In [None]:
X_train, X_test, y_train, y_test = train_test_split(tfidf, labels, test_size=0.2, shuffle=True)

In [None]:
lg_tfidf = LogisticRegression()
lg_tfidf.fit(X_train, y_train)
y_predict = lg_tfidf.predict(X_test)
y_prob = lg_tfidf.predict_proba(X_test)[:,1]
print(classification_report(y_test,y_predict))
mat = confusion_matrix(y_test, y_predict)
print('Confusion Matrix:\n', mat)

sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('predicted label')
plt.ylabel('true label')

In [None]:
lg2_tfidf = LogisticRegression(solver = 'liblinear', random_state = 20, penalty = 'l2')

lg2_tfidf.fit(X_train, y_train)
y_predict2 = lg2_tfidf.predict(X_test)
y_prob2 = lg2_tfidf.predict_proba(X_test)[:,1]
print(classification_report(y_test,y_predict2))
mat = confusion_matrix(y_test, y_predict2)
print('Confusion Matrix:\n', mat)

sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('predicted label')
plt.ylabel('true label')

#### LSTM

In [None]:
# hyperparameters
vocab_size = 5000
embedding_dim = 64
max_length = 200
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'
training_portion = .8

In [None]:
labels = labels.astype(int)
docs = []
for doc in df['body']:
  for word in STOPWORDS:
    token = ' ' + word + ' '
    doc = doc.replace(token, ' ')
    doc = doc.replace(' ', ' ')
  docs.append(doc)
print(len(labels))
print(len(docs))

In [None]:
print(set(labels))

In [None]:
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_docs)
word_index = tokenizer.word_index
dict(list(word_index.items())[0:20])

In [None]:
sequences = tokenizer.texts_to_sequences(docs)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [None]:
label_seq = np.array(labels)

Creating the model with an embedding layer, a bidirectional layer, a droupout layer to avoid overfitting, and a softmax layer.

In [None]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_dim))
model.add(Bidirectional(LSTM(embedding_dim)))
model.add(Dropout(0.2))
model.add(Dense(6, activation='softmax'))
model.summary()

We didn't do one-hot encoding, so used categorical crossentropy for loss. Important to note that these labels are not ordered and do not have any numeric relationship.

In [None]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(padded, label_seq, epochs=10, validation_split=0.2)

In [None]:
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

### Time Sequential Analysis

In [None]:
from datetime import datetime, timezone

utcs = (df['created_utc'].astype(int))
ts = []
month = []
for u in utcs:
   t = datetime.fromtimestamp(u, tz=timezone.utc)
   ts.append(t)
   month.append(datetime.strftime(t, '%m'))

df['timestamp'] = ts
df['month'] = month
df

In [None]:
df.sort_values('month')
df['month'] = df['month'].astype(int)

In [None]:
sns.countplot(x='month', hue='class', data=df)