# DSFB Assignment 5

In this assignment, you will begin to work with text data and natural language processing. You will analyze aspects of th DonorsChoose.org program. Aspects of this project were first posed as a Kaggle challenge and the data comes from [Kaggle DonorsChoose.org Application Screening challenge](https://www.kaggle.com/c/donorschoose-application-screening/data). We have changed the nature of what you need to do in this assignment (so it does not track what was done in the Kaggle Challenge), but nevertheless using or referring to the Kaggle Challenge repository is not allowed for the assignment.

###  DonorsChoose.org  
  
Founded in 2000 by a high school teacher in the Bronx, DonorsChoose.org empowers public school teachers from across the country to request much-needed materials and experiences for their students. At any given time, there are thousands of classroom requests that can be brought to life with a gift of any amount. DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website. In this assignment, you will analyze the text of the essays and requirements from each proposal.

<img src="https://cached.imagescaler.hbpl.co.uk/resize/scaleWidth/580/cached.offlinehbpl.hbpl.co.uk/news/NST/C8B9CC1D-03B0-9B80-4CFE78B5B539240F.jpg" width="500" height="500" align="center"/>

Image source: https://cached.imagescaler.hbpl.co.uk/resize/scaleWidth/580/cached.offlinehbpl.hbpl.co.uk/news/NST/C8B9CC1D-03B0-9B80-4CFE78B5B539240F.jpg

### Data

As you will see, this dataset includes many different kinds of features with structured and unstructured data. The dataset consists of application materials (see *application_data.csv*) and resources requested (see *resource_data.csv*). The application materials (see *application_data.csv*) contain the following features.

| Feature name  | Description  |
|----------------|--------------|
| id  | Unique id of the project application    |
| teacher_id    | id of the teacher submitting the application  |
| teacher_prefix    | title of the teacher's name (Ms., Mr., etc.)    |
| school_state    | US state of the teacher's school    |
| project_submitted_datetime    | application submission timestamp    |
| project_grade_category    | school grade levels (PreK-2, 3-5, 6-8, and 9-12)   |
| project_subject_categories   | category of the project (e.g., "Music & The Arts")    |
| project_subject_subcategories    | sub-category of the project (e.g., "Visual Arts")    |
| project_title    | title of the project    |
| project_essay_1    | first essay*   |
| project_essay_2    | second essay*    |
| project_essay_3    | third essay*   |
| project_essay_4    | fourth essay*  |
| project_resource_summary    | summary of the resources needed for the project    |
| teacher_number_of_previously_posted_projects   | number of previously posted applications by the submitting teacher    |
| project_is_approved    | whether DonorsChoose proposal was accepted (0="rejected", 1="accepted"); train.csv only    |


\*Note: Prior to May 17, 2016, the prompts for the essays were as follows:

  * project_essay_1: "Introduce us to your classroom"  

  * project_essay_2: "Tell us more about your students"  

  * project_essay_3: "Describe how your students will use the materials you're requesting"  

  * project_essay_4: "Close by sharing why your project will make a difference"  

Starting on May 17, 2016, the number of essays was reduced from 4 to 2, and the prompts for the first 2 essays were changed to the following:

  * project_essay_1: "Describe your students: What makes your students special? Specific details about their background, your neighborhood, and your school are all helpful."  

  * project_essay_2: "About your project: How will these materials make a difference in your students' learning and improve their school lives?"  

For all projects with project_submitted_datetime of 2016-05-17 and later, the values of project_essay_3 and project_essay_4 will be missing (i.e. NaN).


### Special NLP Libraries

We will use several new libraries for this assignment - so be sure to first install those on your machine by with `pip` in a terminal:

    pip install --user -U nltk
    pip install -U gensim
    pip install -U spacy
    pip install -U pyldavis

## IMPORTS

In [1]:
# Standard imports
import numpy  as np
import pandas as pd

import itertools
import random
import math  
import copy

from pprint import pprint  # nicer printing

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Other NLP
import re
import spacy
import nltk
from nltk.corpus import stopwords

# General Plotting
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as patches
%matplotlib inline  
import seaborn as sns
sns.set(style="white")

# Special Plotting
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

# ignore some warnings 
import warnings
warnings.filterwarnings('ignore')

# Set the maximum number of rows displayed by pandas
pd.options.display.max_rows = 1000

# Set some CONSTANTS that will be used later
SEED    = 41  # base to generate a random number
SCORE   = 'roc_auc'
FIGSIZE = (16, 10)

# PART 1: Prep

**PROBLEM**: To use a particular model in the `spacy` package, you need to manually download and install that particular model. You will need to run the following code from a terminal: `python -m spacy download en_core_web_sm`. Rather than doing that manually from bash in a separate terminal program, do it inline below using a "magic" command in jupyter. HINT: Use *!* followed by a bash command in a cell to run a bash command.

In [2]:
# Download en_core_web_sm for spacy

!python3 -m spacy download en_core_web_sm

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3 -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


**PROBLEM**: To confirm that `spacy` is working (and `en_core_web_sm` is installed on your computer), you should be able to use `spacy.load()` to build a `Language` object to perform some basic nlp. Do that below:

In [3]:
# Test use of spacy by using the spacy.load() function
import spacy
import en_core_web_sm
nlp = spacy.load('en_core_web_sm')

**PROBLEM**: Use nltk.download() to download a list of raw stopwords. (see NLTK documentation)

In [4]:
# Download NLTK stopwords

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ekaterinakryukova/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**PROBLEM**: Use the `stopwords` object from `nltk` to build a list of English stopwords. 

In [5]:
# Get English Stopwords from NLTK
from nltk.corpus import stopwords
stopWords = stopwords.words('english')

In [6]:
print(len(stopWords))

179


**PROBLEM**: Extend your `stop_words` list with some additional stopwords that you believe should be ignored in this particular context.

In [7]:
# Extend the stop word list  

stopWords.extend(['from', 'subject', 're', 'edu', 'use'])
print(len(stopWords))

184


### Download the Data

Unlike other projects, this project includes a training set too big for GitHub. Through the terminal lab of Jupyter lab, download the data using the *wget* command, unzip it using the *zip* command and check that it's in the root directory of the project. 

Locations : 

    Applications dataset: https://storage.googleapis.com/dsfm-datasets/text-applications/application_data.csv.zip
    Resources dataset: https://storage.googleapis.com/dsfm-datasets/text-applications/resource_data.csv.zip
    
Hint: Use *wget* and *unzip* commands. Use *!* followed by a bash command in a cell to run a bash command.

**PROBLEM**: wget the data

In [8]:
# wget the data
import wget
wget.download('https://storage.googleapis.com/dsfm-datasets/text-applications/application_data.csv.zip','data')

'data/application_data.csv (1).zip'

In [9]:
wget.download('https://storage.googleapis.com/dsfm-datasets/text-applications/resource_data.csv.zip','data')

'data/resource_data.csv (1).zip'

**PROBLEM**: unzip the data

In [10]:
# unzip the data
from zipfile import ZipFile
zip = ZipFile('data/application_data.csv.zip')
zip.extractall('data/application_data')

In [11]:
zip = ZipFile('data/resource_data.csv.zip')
zip.extractall('data/resource_data')


# PART 2: Load Data

**PROBLEM**: Load `application_data.csv` and investigate it a bit.

In [12]:
# Load applications
application_data = pd.read_csv('data/application_data/application_data.csv',parse_dates=['project_submitted_datetime'])
application_data.head(5)


Unnamed: 0,id,teacher_id,teacher_prefix,school_state,project_submitted_datetime,project_grade_category,project_subject_categories,project_subject_subcategories,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved
0,p036502,484aaf11257089a66cfedc9461c6bd0a,Ms.,NV,2016-11-18 14:45:59,Grades PreK-2,Literacy & Language,Literacy,Super Sight Word Centers,Most of my kindergarten students come from low...,I currently have a differentiated sight word c...,,,My students need 6 Ipod Nano's to create and d...,26,1
1,p039565,df72a3ba8089423fa8a94be88060f6ed,Mrs.,GA,2017-04-26 15:57:28,Grades 3-5,"Music & The Arts, Health & Sports","Performing Arts, Team Sports",Keep Calm and Dance On,Our elementary school is a culturally rich sch...,We strive to provide our diverse population of...,,,My students need matching shirts to wear for d...,1,0
2,p233823,a9b876a9252e08a55e3d894150f75ba3,Ms.,UT,2017-01-01 22:57:44,Grades 3-5,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",Lets 3Doodle to Learn,Hello;\r\nMy name is Mrs. Brotherton. I teach ...,We are looking to add some 3Doodler to our cla...,,,My students need the 3doodler. We are an SEM s...,5,1
3,p185307,525fdbb6ec7f538a48beebaa0a51b24f,Mr.,NC,2016-08-12 15:42:11,Grades 3-5,Health & Sports,Health & Wellness,"\""Kid Inspired\"" Equipment to Increase Activit...",My students are the greatest students but are ...,"The student's project which is totally \""kid-i...",,,My students need balls and other activity equi...,16,0
4,p013780,a63b5547a7239eae4c1872670848e61a,Mr.,CA,2016-08-06 09:09:11,Grades 6-8,Health & Sports,Health & Wellness,We need clean water for our culinary arts class!,My students are athletes and students who are ...,For some reason in our kitchen the water comes...,,,My students need a water filtration system for...,42,1


In [13]:
#type of date column
application_data.project_submitted_datetime.dtypes

dtype('<M8[ns]')

In [14]:
#number of nan values
application_data.project_essay_1.isna().sum(),application_data.project_essay_2.isna().sum(),application_data.project_essay_3.isna().sum(),application_data.project_essay_4.isna().sum(),

(0, 0, 175706, 175706)

**PROBLEM**: Load `resource_data.csv` and investigate it a bit.

In [15]:
# Load resources

resource_data = pd.read_csv('data/resource_data/resource_data.csv')
resource_data.head(5)

Unnamed: 0,id,description,quantity,price
0,p233245,LC652 - Lakeshore Double-Space Mobile Drying Rack,1,149.0
1,p069063,Bouncy Bands for Desks (Blue support pipes),3,14.95
2,p069063,Cory Stories: A Kid's Book About Living With Adhd,1,8.45
3,p069063,"Dixon Ticonderoga Wood-Cased #2 HB Pencils, Bo...",2,13.59
4,p069063,EDUCATIONAL INSIGHTS FLUORESCENT LIGHT FILTERS...,3,24.95


**PROBLEM**: Some of the essays are NA. Replace NAs with empty strings.

In [16]:
# Replace NA values in essay columns with ''

application_data[['project_essay_3','project_essay_4']]=application_data[['project_essay_3',
                  'project_essay_4']].replace(np.nan, '', regex=True)

In [17]:
#count nan
application_data.project_essay_1.isna().sum(),application_data.project_essay_2.isna().sum(),application_data.project_essay_3.isna().sum(),application_data.project_essay_4.isna().sum(),

(0, 0, 0, 0)

In [18]:
application_data.columns

Index(['id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category',
       'project_subject_categories', 'project_subject_subcategories',
       'project_title', 'project_essay_1', 'project_essay_2',
       'project_essay_3', 'project_essay_4', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved'],
      dtype='object')

**PROBLEM**: To simplify matters, combine all essays into just one feature called "essays"

In [19]:
# Combine essays
application_data['essays']=application_data['project_essay_{}'.format(1)]
for i in range(2,5):
    application_data['essays']+=application_data['project_essay_{}'.format(i)].astype(str)

In [20]:
#get data with all essays
application_data[application_data.project_submitted_datetime<'2016-05-17'].head(1)

Unnamed: 0,id,teacher_id,teacher_prefix,school_state,project_submitted_datetime,project_grade_category,project_subject_categories,project_subject_subcategories,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved,essays
18,p232007,e7a8f866e3174a77ffe37323f032a8ac,Mrs.,FL,2016-04-27 09:58:04,Grades PreK-2,"Applied Learning, Literacy & Language","College & Career Prep, Literature & Writing",Watch Readers Grow!,During our reading workshop students are at da...,My students lack confidence. I have a class wi...,During our reading workshop. We do mini lesson...,Your donations would greatly be a blessing for...,My students need these reading materials to he...,6,1,During our reading workshop students are at da...


In [21]:
#check
application_data['essays'][18]

'During our reading workshop students are at daily 5. My students need activities to help them practice skills in a fun and enjoyable way that is on the level of each child. As the teacher I enjoy conferencing with each student, so the more engaged the students are practicing the skills they.My students lack confidence. I have a class with such great potential. My students need more hands on learning and extra practice to catch and grow a love for reading. My second graders love to learn. We have resources but most are out dated. My students would be so bright if they could only build confidence. I believe in them and now I want to see them bloom.During our reading workshop. We do mini lessons that focus on a skill and then they rotate through daily 5 (centers). These reading activities will help them apply what they learned and reinforce the skills. They rotate through self read, buddy read (carpet) listening, writing,  word work (reading skills). If the class is actively engaged this

In [22]:
for i in range(1,5):
    print(application_data['project_essay_{}'.format(i)][18])

During our reading workshop students are at daily 5. My students need activities to help them practice skills in a fun and enjoyable way that is on the level of each child. As the teacher I enjoy conferencing with each student, so the more engaged the students are practicing the skills they.
My students lack confidence. I have a class with such great potential. My students need more hands on learning and extra practice to catch and grow a love for reading. My second graders love to learn. We have resources but most are out dated. My students would be so bright if they could only build confidence. I believe in them and now I want to see them bloom.
During our reading workshop. We do mini lessons that focus on a skill and then they rotate through daily 5 (centers). These reading activities will help them apply what they learned and reinforce the skills. They rotate through self read, buddy read (carpet) listening, writing,  word work (reading skills). If the class is actively engaged thi

In [23]:

#drop separate columns of essays
for i in range(1,5):
    application_data.drop(columns=['project_essay_{}'.format(i)],inplace=True)

In [24]:
application_data.columns

Index(['id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category',
       'project_subject_categories', 'project_subject_subcategories',
       'project_title', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'essays'],
      dtype='object')

In [25]:
application_data.columns

Index(['id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category',
       'project_subject_categories', 'project_subject_subcategories',
       'project_title', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'essays'],
      dtype='object')

In [26]:
resource_data.columns

Index(['id', 'description', 'quantity', 'price'], dtype='object')

**PROBLEM**: Merge the resources and application datasets on the *id* feature.

In [27]:
# Merge two datasets


merged_df=application_data.merge(resource_data, on='id', how='inner')
# Check the data to confirm it worked

merged_df.columns


Index(['id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category',
       'project_subject_categories', 'project_subject_subcategories',
       'project_title', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'essays', 'description', 'quantity', 'price'],
      dtype='object')

**PROBLEM**: Keep the following data for additional analysis (the id and the text features): `id`, `school_state`, `project_subject_categories`, `project_subject_subcategories`, `essays`, `description`

In [28]:
FEATURE_NAMES = ['school_state', 'project_subject_categories', 'project_subject_subcategories', 'essays', 'description']

In [29]:
# Keep the Text Featuresss

merged_df_textual=merged_df[['id']+FEATURE_NAMES]

In [30]:
merged_df_textual.head()

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p036502,NV,Literacy & Language,Literacy,Most of my kindergarten students come from low...,Apple - iPod nano� 16GB MP3 Player (8th Genera...
1,p036502,NV,Literacy & Language,Literacy,Most of my kindergarten students come from low...,Apple - iPod nano� 16GB MP3 Player (8th Genera...
2,p039565,GA,"Music & The Arts, Health & Sports","Performing Arts, Team Sports",Our elementary school is a culturally rich sch...,Reebok Girls' Fashion Dance Graphic T-Shirt - ...
3,p233823,UT,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",Hello;\r\nMy name is Mrs. Brotherton. I teach ...,3doodler Start Full Edu Bundle
4,p185307,NC,Health & Sports,Health & Wellness,My students are the greatest students but are ...,BALL PG 4'' POLY SET OF 6 COLORS


# PART 3: Preprocess Text

Make an independent copy of the data so we can restart here when testing...

In [31]:
data = copy.copy(merged_df_textual)  # when "merged" is the pandas dataframe

**PROBLEM**: Define a custom function `clean_punctuation()` to remove some punctuation from your text data. You don't have to do absolutely everything one might want to do - just show that you can do it. Start with each some easy operations with `str.replace()`.

In [32]:
# Define a custom function to clean punctuation from  given text

def clean_punctuation(txt):
    txt=txt.replace('&', ' ')
    txt=txt.replace('.', ' ')
    txt=txt.replace("\\r\\n", " ")
    return txt

**PROBLEM**: Use the `apply()` function from pandas to _apply_ that function down the `essays` column of your data.

In [33]:
# Apply your function to clean the essays column
for feature in ['school_state', 'project_subject_categories', 'project_subject_subcategories', 'essays']:
    data[feature]=data[feature].apply(clean_punctuation)
    
    
    
data.head()

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p036502,NV,Literacy Language,Literacy,Most of my kindergarten students come from low...,Apple - iPod nano� 16GB MP3 Player (8th Genera...
1,p036502,NV,Literacy Language,Literacy,Most of my kindergarten students come from low...,Apple - iPod nano� 16GB MP3 Player (8th Genera...
2,p039565,GA,"Music The Arts, Health Sports","Performing Arts, Team Sports",Our elementary school is a culturally rich sch...,Reebok Girls' Fashion Dance Graphic T-Shirt - ...
3,p233823,UT,"Math Science, Literacy Language","Applied Sciences, Literature Writing",Hello; My name is Mrs Brotherton I teach 5th...,3doodler Start Full Edu Bundle
4,p185307,NC,Health Sports,Health Wellness,My students are the greatest students but are ...,BALL PG 4'' POLY SET OF 6 COLORS


**PROBLEM**: Define **another** custom function called `clean_re()` to clean your text data using regular expressions. Do at least two "cleanings" (i.e., show that you can use the `re` library).

In [34]:
# Define a custom function to clean some given text
import re

def clean_re(txt):
    p = re.compile(r'[^\w\s]')
    txt=p.sub('', txt)
    
    return txt

In [35]:
# Apply clean_re() to all features

for feature in FEATURE_NAMES:
    data[feature]=data[feature].astype(str).apply(clean_re)
    
    
    
data.head()


Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p036502,NV,Literacy Language,Literacy,Most of my kindergarten students come from low...,Apple iPod nano 16GB MP3 Player 8th Generatio...
1,p036502,NV,Literacy Language,Literacy,Most of my kindergarten students come from low...,Apple iPod nano 16GB MP3 Player 8th Generatio...
2,p039565,GA,Music The Arts Health Sports,Performing Arts Team Sports,Our elementary school is a culturally rich sch...,Reebok Girls Fashion Dance Graphic TShirt Dd ...
3,p233823,UT,Math Science Literacy Language,Applied Sciences Literature Writing,Hello My name is Mrs Brotherton I teach 5th ...,3doodler Start Full Edu Bundle
4,p185307,NC,Health Sports,Health Wellness,My students are the greatest students but are ...,BALL PG 4 POLY SET OF 6 COLORS


In [36]:
data['description'].head(10)

0    Apple  iPod nano 16GB MP3 Player 8th Generatio...
1    Apple  iPod nano 16GB MP3 Player 8th Generatio...
2    Reebok Girls Fashion Dance Graphic TShirt  Dd ...
3                       3doodler Start Full Edu Bundle
4                       BALL PG 4 POLY SET OF 6 COLORS
5                     BALL PLAYGROUND POLY 85 SET OF 6
6                            KIT JUMBO GRADESTUFF PACK
7                           PARACHUTE GRIPSTARCHUTE 24
8                           RECESS PACK GRADE K VIOLET
9    Crown Berkey Water Filter With 2 Black and 2 P...
Name: description, dtype: object

**PROBLEM**: Remove stopwords. (Hint: use stopwords from nltk's `stopwords()` plus any additions you'd like to make. Then, again, define a custom function and then apply it to all features.)

In [37]:
stopWords.extend(['although','engaging','approximately','yet','nan','u','us','would','would','see','big','student'])
print(len(stopWords))

196


In [38]:
# Define custom function to remove stopwords
df = copy.copy(data)  
def clean_stopword(txt):
    txt=txt.lower().split()
    filtered_sentence = [w for w in txt if  w  not in  stopWords]  
    filtered_sentence=' '.join(filtered_sentence)
    return filtered_sentence

In [39]:
# Apply function to remove stopwords  
for feature in FEATURE_NAMES:
    df[feature]=df[feature].apply(clean_stopword)
    
    
    
df.head(10)



KeyboardInterrupt: 

**PROBLEM**: Now use Gensim’s `simple_preprocess()` function to tokenize and clean up your text data. TIP: `simple_preprocess()` returns a list of words, so we want to wrap it with a function that joins the list back together into a string.

In [None]:
# Define custom function to wrap c from gensim
from gensim.utils import simple_preprocess
df2 = copy.copy(df) 
def simple_preprocess_custom(txt):
    txt=simple_preprocess(txt)
    txt=' '.join(txt)
    return txt

In [None]:
# Apply simple_preprocess() to all features

for feature in FEATURE_NAMES:
    df2[feature]=df2[feature].apply(simple_preprocess_custom)

In [None]:
df2.head()

**PROBLEM**: Lemmatize the text. (Hint: Define a custom function and then apply it to all features.)

In [None]:
nltk.download('wordnet')

In [None]:
# Write a lemmatization function based on nltk.stem.WordNetLemmatizer()
from nltk.corpus import wordnet
from collections import Counter
df3=copy.copy(df2)
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def get_pos( word ):
    w_synsets = wordnet.synsets(word)

    pos_counts = Counter()
    pos_counts["n"] = len(  [ item for item in w_synsets if item.pos()=="n"]  )
    pos_counts["v"] = len(  [ item for item in w_synsets if item.pos()=="v"]  )
    pos_counts["a"] = len(  [ item for item in w_synsets if item.pos()=="a"]  )
    pos_counts["r"] = len(  [ item for item in w_synsets if item.pos()=="r"]  )
    
    most_common_pos_list = pos_counts.most_common(3)
    return most_common_pos_list[0][0]

def lemmatize_text(text):
    text=[lemmatizer.lemmatize(w,get_pos(w)) for w in w_tokenizer.tokenize(text)]
    return text

In [None]:
# Apply lemmatize_text() to all features  
from tqdm import tqdm
for feature in tqdm(['essays','description']):
    df3[feature]=df3[feature].apply(lemmatize_text)


In [None]:
df3.head()

**PROBLEM**: What happened to the data in the pandas dataframe>

ANSWER: It was converted from long text into a list of individual words.

# PART 4:  Make an LDA topic model for the ESSAYS.

**Note: Part 4 is worth 10 points (the value of 10 individual problems).**

Define an LDA topic model for the `essays`. Compute the "Coherence score." Visually inspect the topic model by inspecting the top keywords from each model. Gensim provides functions for all of these tasks.  

In [None]:

# Create Dictionary
id2word = corpora.Dictionary(df3['essays'])

# Create Corpus
texts = df3['essays']
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View unique id for each word in the essay
print(corpus[:1])


In [None]:
# View word a given id corresponds to for id=0
id2word[0]

In [None]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

In [None]:
#df.project_subject_categories.value_counts()

In [None]:

# Build LDA model --too long
"""lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=13, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)"""



In [None]:
import time
start_time = time.time()
##
## Build Multicore LDA
lda_multicore_model = gensim.models.ldamulticore.LdaMulticore(corpus, num_topics = 7, id2word=id2word,random_state=100,passes=10)
# Saving trained model
lda_multicore_model.save('LDA_NYT_multicore')
# Loading trained model
lda_multicore_model = gensim.models.ldamodel.LdaModel.load('LDA_NYT_multicore')
## Print time taken to train the model
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
df3['essays'].head()

In [None]:
# Print the Keyword in the 7 topics
print(lda_multicore_model.print_topics())
doc_lda = lda_multicore_model[corpus]

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_multicore_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_multicore_model, texts=df3['essays'], dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

If you use gensim and the following three variables, then you can visualize topics & keywords with the code below.

    lda_model:    this is an LDA model generated by gensim.models.ldamodel.LdaModel()
    id2word:      this is the dictionary term IDs from corpora.Dictionary()
    corpus:       this is the collection of "documents"


In [None]:
# Visualize topics-keywords
lda_model=lda_multicore_model
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

# PART 5:  Make an LDA topic model for the DESCRIPTIONS.

**Note: Part 5 is worth 5 points (the value of 5 individual problems).**

Using the same K (and any other hyperparameters from Part 4), recompute a model for Descriptions. Compare the two sets of results. Do they vary? How? Why? Explain what you find. 

In [None]:

# Create Dictionary
id2word_description = corpora.Dictionary(df3['description'])

# Create Corpus
texts_description = df3['description']
# Term Document Frequency
corpus_description = [id2word.doc2bow(text) for text in texts_description]

# View unique id for each word in the essay
print(corpus_description[:1])


In [None]:
# 
import time
start_time = time.time()
##
## Build Multicore LDA
lda_multicore_model_d = gensim.models.ldamulticore.LdaMulticore(corpus_description, num_topics = 7, 
                                                                id2word_description,random_state=100,passes=10)
# Saving trained model
lda_multicore_model_d.save('LDA_NYT_multicore_d')
# Loading trained model
lda_multicore_model_d = gensim.models.ldamodel.LdaModel.load('LDA_NYT_multicore_d')
## Print time taken to train the model
print("--- %s seconds ---" % (time.time() - start_time))




In [None]:
# Print the Keyword in the 7 topics
print(lda_multicore_model_d.print_topics())
doc_lda_d = lda_multicore_model_d[corpus_description]

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_multicore_model_d.log_perplexity(corpus_description))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_multicore_model_d, texts=df3['description'], dictionary=id2word_description, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
# Visualize topics-keywords
lda_model=lda_multicore_model_d
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus_description, id2word_description)
vis

# PART 6:  Use TextHero and help to improve it.

**Note: This is worth 5 points (the value of 5 individual problems).**

[TextHero](https://texthero.org/) is an opensource project developed by a student from the [TIS Lab of Prof. Younge](www.epfl.ch/labs/tis). Go to the [GIT repository for TextHero](https://github.com/jbesomi/texthero), install the package, review the documentation, and if you are impressed by the package - give it a star and tell others! (Not required)

Once you understand TextHero, then use the package to re-implement major portions of Part 3 of this assignment that you completed above.  

In [None]:

!pip install texthero



In [None]:
import texthero as hero


In [None]:
data2=copy.copy(merged_df_textual) 

In [None]:
for column in data2.columns:
    data2[column] = data2[column] .pipe(hero.clean)

In [None]:
data2['tfidf'] = (
    hero.tfidf(data2['essays'])
)

In [None]:
data2['pca'] = hero.pca(data2['tfidf'])
hero.scatterplot(
    df, 
    col='pca', 
    color='topic', 
    title="PCA BBC Sport news"
)

**Note: This is worth 5 points (the value of 5 individual problems).**

OpenSourcve packages rely on the community of users to help them grow and improve. Review the [contributing file](https://github.com/jbesomi/texthero/blob/master/CONTRIBUTING.md) for Text Hero and then identify a portion of the documentation that you feel could be improves. Edit/write 1 paragraph of documentation for the package that you believe would improve it. Copy that paragraph in below (into this notebook) so that it can be graded. And - if you think your contribution would truly help the project, please learn how to use git to suggest the change (a pull request) to the manager of the repository (Jonathan Besomi). 

Documentation 

texthero.representation.nmf

nmf(s, n_components=2)

    Perform non-negative matrix factorization.

Find two non-negative matrices (W, H) whose product approximates the non-negative matrix X. 
This factorization can be used for example for dimensionality reduction, source separation or topic extraction.

Parameters

    s: Pandas Series
    n_components: Int. Default is 2.
        Number of components to keep. If n_components is not set or None, all components are kept.

Examples

import texthero as hero

import pandas as pd

s = pd.Series(["Sentence one", "Sentence two"])

custom_pipeline = [preprocessing.lowercase,
                   preprocessing.remove_whitespace]
s= hero.clean(s, custom_pipeline)
s_tf_idf = hero.tfidf(s)
pca=hero.nmf(s_tf_idf)

Documentation for pca

Example:
    
import texthero as hero

import pandas as pd

s = pd.Series(["Sentence one", "Sentence two"])

custom_pipeline = [preprocessing.lowercase,
                   preprocessing.remove_whitespace]
                   
s= hero.clean(s, custom_pipeline)

s_tf_idf = hero.tfidf(s)

pca=hero.pca(s_tf_idf)
