# DSFB Assignment 5

In this assignment, you will begin to work with text data and natural language processing. You will analyze aspects of th DonorsChoose.org program. Aspects of this project were first posed as a Kaggle challenge and the data comes from [Kaggle DonorsChoose.org Application Screening challenge](https://www.kaggle.com/c/donorschoose-application-screening/data). We have changed the nature of what you need to do in this assignment (so it does not track what was done in the Kaggle Challenge), but nevertheless using or referring to the Kaggle Challenge repository is not allowed for the assignment.

###  DonorsChoose.org  
  
Founded in 2000 by a high school teacher in the Bronx, DonorsChoose.org empowers public school teachers from across the country to request much-needed materials and experiences for their students. At any given time, there are thousands of classroom requests that can be brought to life with a gift of any amount. DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website. In this assignment, you will analyze the text of the essays and requirements from each proposal.

<img src="https://cached.imagescaler.hbpl.co.uk/resize/scaleWidth/580/cached.offlinehbpl.hbpl.co.uk/news/NST/C8B9CC1D-03B0-9B80-4CFE78B5B539240F.jpg" width="500" height="500" align="center"/>

Image source: https://cached.imagescaler.hbpl.co.uk/resize/scaleWidth/580/cached.offlinehbpl.hbpl.co.uk/news/NST/C8B9CC1D-03B0-9B80-4CFE78B5B539240F.jpg

### Data

As you will see, this dataset includes many different kinds of features with structured and unstructured data. The dataset consists of application materials (see *application_data.csv*) and resources requested (see *resource_data.csv*). The application materials (see *application_data.csv*) contain the following features.

| Feature name  | Description  |
|----------------|--------------|
| id  | Unique id of the project application    |
| teacher_id    | id of the teacher submitting the application  |
| teacher_prefix    | title of the teacher's name (Ms., Mr., etc.)    |
| school_state    | US state of the teacher's school    |
| project_submitted_datetime    | application submission timestamp    |
| project_grade_category    | school grade levels (PreK-2, 3-5, 6-8, and 9-12)   |
| project_subject_categories   | category of the project (e.g., "Music & The Arts")    |
| project_subject_subcategories    | sub-category of the project (e.g., "Visual Arts")    |
| project_title    | title of the project    |
| project_essay_1    | first essay*   |
| project_essay_2    | second essay*    |
| project_essay_3    | third essay*   |
| project_essay_4    | fourth essay*  |
| project_resource_summary    | summary of the resources needed for the project    |
| teacher_number_of_previously_posted_projects   | number of previously posted applications by the submitting teacher    |
| project_is_approved    | whether DonorsChoose proposal was accepted (0="rejected", 1="accepted"); train.csv only    |


\*Note: Prior to May 17, 2016, the prompts for the essays were as follows:

  * project_essay_1: "Introduce us to your classroom"  

  * project_essay_2: "Tell us more about your students"  

  * project_essay_3: "Describe how your students will use the materials you're requesting"  

  * project_essay_4: "Close by sharing why your project will make a difference"  

Starting on May 17, 2016, the number of essays was reduced from 4 to 2, and the prompts for the first 2 essays were changed to the following:

  * project_essay_1: "Describe your students: What makes your students special? Specific details about their background, your neighborhood, and your school are all helpful."  

  * project_essay_2: "About your project: How will these materials make a difference in your students' learning and improve their school lives?"  

For all projects with project_submitted_datetime of 2016-05-17 and later, the values of project_essay_3 and project_essay_4 will be missing (i.e. NaN).


### Special NLP Libraries

We will use several new libraries for this assignment - so be sure to first install those on your machine by with `pip` in a terminal:

    pip install --user -U nltk
    pip install -U gensim
    pip install -U spacy
    pip install -U pyldavis

## IMPORTS

In [130]:
import importlib
# Standard imports
import numpy  as np
import pandas as pd

import itertools
import random
import math  
import copy

from pprint import pprint  # nicer printing

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Other NLP
import re
import spacy
import nltk
from nltk.corpus import stopwords

# General Plotting
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as patches
%matplotlib inline  
import seaborn as sns
sns.set(style="white")

# Special Plotting
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

# ignore some warnings 
import warnings
warnings.filterwarnings('ignore')

# Set the maximum number of rows displayed by pandas
pd.options.display.max_rows = 1000

# Set some CONSTANTS that will be used later
SEED    = 41  # base to generate a random number
SCORE   = 'roc_auc'
FIGSIZE = (16, 10)

# PART 1: Prep

**PROBLEM**: To use a particular model in the `spacy` package, you need to manually download and install that particular model. You will need to run the following code from a terminal: `python -m spacy download en_core_web_sm`. Rather than doing that manually from bash in a separate terminal program, do it inline below using a "magic" command in jupyter. HINT: Use *!* followed by a bash command in a cell to run a bash command.

In [131]:
# Download en_core_web_sm for spacy

!python3 -m spacy download en_core_web_sm

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3 -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


**PROBLEM**: To confirm that `spacy` is working (and `en_core_web_sm` is installed on your computer), you should be able to use `spacy.load()` to build a `Language` object to perform some basic nlp. Do that below:

In [132]:
# Test use of spacy by using the spacy.load() function
import spacy
import en_core_web_sm
nlp = spacy.load('en_core_web_sm')

**PROBLEM**: Use nltk.download() to download a list of raw stopwords. (see NLTK documentation)

In [133]:
# Download NLTK stopwords

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ekaterinakryukova/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**PROBLEM**: Use the `stopwords` object from `nltk` to build a list of English stopwords. 

In [134]:
# Get English Stopwords from NLTK
from nltk.corpus import stopwords
stopWords = stopwords.words('english')

In [135]:
print(len(stopWords))

179


**PROBLEM**: Extend your `stop_words` list with some additional stopwords that you believe should be ignored in this particular context.

In [136]:
# Extend the stop word list  

stopWords.extend(['from', 'subject', 're', 'edu', 'use'])

print(len(stopWords))

184


### Download the Data

Unlike other projects, this project includes a training set too big for GitHub. Through the terminal lab of Jupyter lab, download the data using the *wget* command, unzip it using the *zip* command and check that it's in the root directory of the project. 

Locations : 

    Applications dataset: https://storage.googleapis.com/dsfm-datasets/text-applications/application_data.csv.zip
    Resources dataset: https://storage.googleapis.com/dsfm-datasets/text-applications/resource_data.csv.zip
    
Hint: Use *wget* and *unzip* commands. Use *!* followed by a bash command in a cell to run a bash command.

**PROBLEM**: wget the data

In [137]:
# wget the data
import wget
wget.download('https://storage.googleapis.com/dsfm-datasets/text-applications/application_data.csv.zip','data')

'data/application_data.csv (1).zip'

In [138]:
wget.download('https://storage.googleapis.com/dsfm-datasets/text-applications/resource_data.csv.zip','data')

'data/resource_data.csv (1).zip'

**PROBLEM**: unzip the data

In [139]:
# unzip the data
from zipfile import ZipFile
zip = ZipFile('data/application_data.csv.zip')
zip.extractall('data/application_data')

In [140]:
zip = ZipFile('data/resource_data.csv.zip')
zip.extractall('data/resource_data')


# PART 2: Load Data

**PROBLEM**: Load `application_data.csv` and investigate it a bit.

In [141]:
# Load applications
application_data = pd.read_csv('data/application_data/application_data.csv',parse_dates=['project_submitted_datetime'])
application_data.head(5)


Unnamed: 0,id,teacher_id,teacher_prefix,school_state,project_submitted_datetime,project_grade_category,project_subject_categories,project_subject_subcategories,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved
0,p036502,484aaf11257089a66cfedc9461c6bd0a,Ms.,NV,2016-11-18 14:45:59,Grades PreK-2,Literacy & Language,Literacy,Super Sight Word Centers,Most of my kindergarten students come from low...,I currently have a differentiated sight word c...,,,My students need 6 Ipod Nano's to create and d...,26,1
1,p039565,df72a3ba8089423fa8a94be88060f6ed,Mrs.,GA,2017-04-26 15:57:28,Grades 3-5,"Music & The Arts, Health & Sports","Performing Arts, Team Sports",Keep Calm and Dance On,Our elementary school is a culturally rich sch...,We strive to provide our diverse population of...,,,My students need matching shirts to wear for d...,1,0
2,p233823,a9b876a9252e08a55e3d894150f75ba3,Ms.,UT,2017-01-01 22:57:44,Grades 3-5,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",Lets 3Doodle to Learn,Hello;\r\nMy name is Mrs. Brotherton. I teach ...,We are looking to add some 3Doodler to our cla...,,,My students need the 3doodler. We are an SEM s...,5,1
3,p185307,525fdbb6ec7f538a48beebaa0a51b24f,Mr.,NC,2016-08-12 15:42:11,Grades 3-5,Health & Sports,Health & Wellness,"\""Kid Inspired\"" Equipment to Increase Activit...",My students are the greatest students but are ...,"The student's project which is totally \""kid-i...",,,My students need balls and other activity equi...,16,0
4,p013780,a63b5547a7239eae4c1872670848e61a,Mr.,CA,2016-08-06 09:09:11,Grades 6-8,Health & Sports,Health & Wellness,We need clean water for our culinary arts class!,My students are athletes and students who are ...,For some reason in our kitchen the water comes...,,,My students need a water filtration system for...,42,1


In [142]:
application_data.shape

(182080, 16)

In [143]:
#type of date column
application_data.project_submitted_datetime.dtypes

dtype('<M8[ns]')

In [144]:
#number of nan values
application_data.project_essay_1.isna().sum(),application_data.project_essay_2.isna().sum(),application_data.project_essay_3.isna().sum(),application_data.project_essay_4.isna().sum(),

(0, 0, 175706, 175706)

**PROBLEM**: Load `resource_data.csv` and investigate it a bit.

In [145]:
# Load resources

resource_data = pd.read_csv('data/resource_data/resource_data.csv')
resource_data.head(5)

Unnamed: 0,id,description,quantity,price
0,p233245,LC652 - Lakeshore Double-Space Mobile Drying Rack,1,149.0
1,p069063,Bouncy Bands for Desks (Blue support pipes),3,14.95
2,p069063,Cory Stories: A Kid's Book About Living With Adhd,1,8.45
3,p069063,"Dixon Ticonderoga Wood-Cased #2 HB Pencils, Bo...",2,13.59
4,p069063,EDUCATIONAL INSIGHTS FLUORESCENT LIGHT FILTERS...,3,24.95


**PROBLEM**: Some of the essays are NA. Replace NAs with empty strings.

In [146]:
# Replace NA values in essay columns with ''

application_data[['project_essay_3','project_essay_4']]=application_data[['project_essay_3',
                  'project_essay_4']].replace(np.nan, '', regex=True)

In [147]:
#count nan
application_data.project_essay_1.isna().sum(),application_data.project_essay_2.isna().sum(),application_data.project_essay_3.isna().sum(),application_data.project_essay_4.isna().sum(),

(0, 0, 0, 0)

In [148]:
application_data.columns

Index(['id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category',
       'project_subject_categories', 'project_subject_subcategories',
       'project_title', 'project_essay_1', 'project_essay_2',
       'project_essay_3', 'project_essay_4', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved'],
      dtype='object')

**PROBLEM**: To simplify matters, combine all essays into just one feature called "essays"

In [149]:
# Combine essays
application_data['essays']=application_data['project_essay_{}'.format(1)]
for i in range(2,5):
    application_data['essays']+=application_data['project_essay_{}'.format(i)].astype(str)

In [150]:
#get data with all essays
application_data[application_data.project_submitted_datetime<'2016-05-17'].head(1)

Unnamed: 0,id,teacher_id,teacher_prefix,school_state,project_submitted_datetime,project_grade_category,project_subject_categories,project_subject_subcategories,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved,essays
18,p232007,e7a8f866e3174a77ffe37323f032a8ac,Mrs.,FL,2016-04-27 09:58:04,Grades PreK-2,"Applied Learning, Literacy & Language","College & Career Prep, Literature & Writing",Watch Readers Grow!,During our reading workshop students are at da...,My students lack confidence. I have a class wi...,During our reading workshop. We do mini lesson...,Your donations would greatly be a blessing for...,My students need these reading materials to he...,6,1,During our reading workshop students are at da...


In [151]:
#check
application_data['essays'][18]

'During our reading workshop students are at daily 5. My students need activities to help them practice skills in a fun and enjoyable way that is on the level of each child. As the teacher I enjoy conferencing with each student, so the more engaged the students are practicing the skills they.My students lack confidence. I have a class with such great potential. My students need more hands on learning and extra practice to catch and grow a love for reading. My second graders love to learn. We have resources but most are out dated. My students would be so bright if they could only build confidence. I believe in them and now I want to see them bloom.During our reading workshop. We do mini lessons that focus on a skill and then they rotate through daily 5 (centers). These reading activities will help them apply what they learned and reinforce the skills. They rotate through self read, buddy read (carpet) listening, writing,  word work (reading skills). If the class is actively engaged this

In [152]:
for i in range(1,5):
    print(application_data['project_essay_{}'.format(i)][18])

During our reading workshop students are at daily 5. My students need activities to help them practice skills in a fun and enjoyable way that is on the level of each child. As the teacher I enjoy conferencing with each student, so the more engaged the students are practicing the skills they.
My students lack confidence. I have a class with such great potential. My students need more hands on learning and extra practice to catch and grow a love for reading. My second graders love to learn. We have resources but most are out dated. My students would be so bright if they could only build confidence. I believe in them and now I want to see them bloom.
During our reading workshop. We do mini lessons that focus on a skill and then they rotate through daily 5 (centers). These reading activities will help them apply what they learned and reinforce the skills. They rotate through self read, buddy read (carpet) listening, writing,  word work (reading skills). If the class is actively engaged thi

In [153]:

#drop separate columns of essays
for i in range(1,5):
    application_data.drop(columns=['project_essay_{}'.format(i)],inplace=True)

In [154]:
application_data.columns

Index(['id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category',
       'project_subject_categories', 'project_subject_subcategories',
       'project_title', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'essays'],
      dtype='object')

In [155]:
application_data.columns

Index(['id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category',
       'project_subject_categories', 'project_subject_subcategories',
       'project_title', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'essays'],
      dtype='object')

In [156]:
application_data.shape

(182080, 13)

In [157]:
resource_data.head()

Unnamed: 0,id,description,quantity,price
0,p233245,LC652 - Lakeshore Double-Space Mobile Drying Rack,1,149.0
1,p069063,Bouncy Bands for Desks (Blue support pipes),3,14.95
2,p069063,Cory Stories: A Kid's Book About Living With Adhd,1,8.45
3,p069063,"Dixon Ticonderoga Wood-Cased #2 HB Pencils, Bo...",2,13.59
4,p069063,EDUCATIONAL INSIGHTS FLUORESCENT LIGHT FILTERS...,3,24.95


**PROBLEM**: Merge the resources and application datasets on the *id* feature.

In [158]:
resource_data=resource_data.fillna(' ')
resource_data=resource_data.drop_duplicates()
resource_data['description']=resource_data.groupby(['id'])['description'].transform(lambda x : ' '.join(x)) 


In [159]:
# Merge two datasets


merged_df=application_data.merge(resource_data, on='id', how='left')
# Check the data to confirm it worked

merged_df.columns


Index(['id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category',
       'project_subject_categories', 'project_subject_subcategories',
       'project_title', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'essays', 'description', 'quantity', 'price'],
      dtype='object')

In [160]:
merged_df.shape

(1073254, 16)

In [161]:
resource_data.shape,resource_data.id.nunique()

((1528928, 4), 260115)

In [162]:
application_data.shape

(182080, 13)

In [163]:
application_data.id.nunique()

182080

**PROBLEM**: Keep the following data for additional analysis (the id and the text features): `id`, `school_state`, `project_subject_categories`, `project_subject_subcategories`, `essays`, `description`

In [164]:
FEATURE_NAMES = ['school_state', 'project_subject_categories', 'project_subject_subcategories', 'essays', 'description']

In [165]:
# Keep the Text Featuresss

merged_df_textual=merged_df[['id']+FEATURE_NAMES]

In [166]:
merged_df_textual.head()

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p036502,NV,Literacy & Language,Literacy,Most of my kindergarten students come from low...,Apple - iPod nano� 16GB MP3 Player (8th Genera...
1,p036502,NV,Literacy & Language,Literacy,Most of my kindergarten students come from low...,Apple - iPod nano� 16GB MP3 Player (8th Genera...
2,p039565,GA,"Music & The Arts, Health & Sports","Performing Arts, Team Sports",Our elementary school is a culturally rich sch...,Reebok Girls' Fashion Dance Graphic T-Shirt - ...
3,p233823,UT,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",Hello;\r\nMy name is Mrs. Brotherton. I teach ...,3doodler Start Full Edu Bundle
4,p185307,NC,Health & Sports,Health & Wellness,My students are the greatest students but are ...,BALL PG 4'' POLY SET OF 6 COLORS BALL PLAYGROU...


In [167]:
merged_df_textual=merged_df_textual.drop_duplicates()

In [168]:
merged_df_textual.shape

(182080, 6)

In [169]:
merged_df_textual=merged_df_textual.sample(n=100000, random_state=1)
merged_df_textual.to_csv('merged_df_textual.csv',index=False)

# PART 3: Preprocess Text

Make an independent copy of the data so we can restart here when testing...

In [170]:
data = copy.copy(merged_df_textual).fillna(' ')  # when "merged" is the pandas dataframe

**PROBLEM**: Define a custom function `clean_punctuation()` to remove some punctuation from your text data. You don't have to do absolutely everything one might want to do - just show that you can do it. Start with each some easy operations with `str.replace()`.

In [171]:
# Define a custom function to clean punctuation from  given text

def clean_punctuation(txt):
    txt=txt.replace('&', ' ')
    txt=txt.replace('.', ' ')
    txt=txt.replace("\\r\\n", " ")
    return txt

**PROBLEM**: Use the `apply()` function from pandas to _apply_ that function down the `essays` column of your data.

In [172]:
# Apply your function to clean the essays column
for feature in ['essays']:
    data[feature]=data[feature].apply(clean_punctuation)
    
    
    
data.head()

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
754205,p159955,VA,Math & Science,"Applied Sciences, Mathematics",We are a small Title I school Our school has ...,Amazon - Fire - 7'- Tablet - 16GB - Blue Amazo...
228971,p251390,CA,Health & Sports,"Health & Wellness, Team Sports",A new year and a new set of eager learners and...,12 Standard Scooter Boards with Handles Set o...
56487,p075249,OH,Health & Sports,Health & Wellness,I teach in a low income/high poverty area My...,Drive Medical Deluxe Folding Exercise Peddler ...
178898,p082333,FL,Applied Learning,"Character Education, Other","Gifted, doesn't mean perfect In fact Gifted ...","Celestial Seasonings Green Tea K-Cups, Authent..."
740661,p256298,NC,Music & The Arts,Visual Arts,As a teacher of a large group of our 3rd and ...,CANVAS MINI CLASSROOM SET OF 180 MARKER CRAYOL...


**PROBLEM**: Define **another** custom function called `clean_re()` to clean your text data using regular expressions. Do at least two "cleanings" (i.e., show that you can use the `re` library).

In [173]:
# Define a custom function to clean some given text
import re

def clean_re(txt):
    p = re.compile(r'[^\w\s]')
    txt=p.sub('', txt)
    
    return txt

In [174]:
# Apply clean_re() to all features
from tqdm import tqdm
for feature in tqdm(FEATURE_NAMES):
    data[feature]=data[feature].fillna('').astype(str).apply(clean_re)
    
    
data.head()


100%|██████████| 5/5 [00:05<00:00,  1.14s/it]


Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
754205,p159955,VA,Math Science,Applied Sciences Mathematics,We are a small Title I school Our school has ...,Amazon Fire 7 Tablet 16GB Blue Amazon Fir...
228971,p251390,CA,Health Sports,Health Wellness Team Sports,A new year and a new set of eager learners and...,12 Standard Scooter Boards with Handles Set o...
56487,p075249,OH,Health Sports,Health Wellness,I teach in a low incomehigh poverty area My ...,Drive Medical Deluxe Folding Exercise Peddler ...
178898,p082333,FL,Applied Learning,Character Education Other,Gifted doesnt mean perfect In fact Gifted St...,Celestial Seasonings Green Tea KCups Authentic...
740661,p256298,NC,Music The Arts,Visual Arts,As a teacher of a large group of our 3rd and ...,CANVAS MINI CLASSROOM SET OF 180 MARKER CRAYOL...


In [175]:
data.to_csv('clean_re.csv', index=False)    

data.shape

(100000, 6)

In [176]:
data=pd.read_csv('clean_re.csv',lineterminator='\n')

In [177]:
data.shape


(100000, 6)

In [178]:
data['description'].head(10)

0    Amazon  Fire  7 Tablet  16GB  Blue Amazon  Fir...
1    12 Standard Scooter Boards with Handles  Set o...
2    Drive Medical Deluxe Folding Exercise Peddler ...
3    Celestial Seasonings Green Tea KCups Authentic...
4    CANVAS MINI CLASSROOM SET OF 180 MARKER CRAYOL...
5       Apple  iPad mini 2 with WiFi  16GB  Space Gray
6    Apple  MacBook Air Latest Model  133Display  I...
7    Apple  MacBook Pro with Retina display Latest ...
8              Apple  iPad Air 2 WiFi 64GB  Space Gray
9    ASUS Chromebook C300SA 133 Inch Intel Celeron ...
Name: description, dtype: object

In [179]:
data.drop_duplicates().shape

(100000, 6)

**PROBLEM**: Remove stopwords. (Hint: use stopwords from nltk's `stopwords()` plus any additions you'd like to make. Then, again, define a custom function and then apply it to all features.)

In [180]:
stopWords.extend(['although','engaging','approximately','yet','nan','u','us','would','would','see','big','student','school','many'])


In [181]:
# Define custom function to remove stopwords
df = copy.copy(data)  
df=df.drop_duplicates()
def clean_stopword(txt):
    txt=txt.lower().split()
    filtered_sentence = [w for w in txt if  w  not in  stopWords]  
    filtered_sentence=' '.join(filtered_sentence)
    return filtered_sentence

In [182]:
# Apply function to remove stopwords  
from tqdm import tqdm
for feature in tqdm(FEATURE_NAMES):
    df[feature]=df[feature].fillna(' ').apply(clean_stopword)
    
    
df.to_csv('clean_stopword.csv', index=False)       
df.head(10)



100%|██████████| 5/5 [01:05<00:00, 13.16s/it]


Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p159955,va,math science,applied sciences mathematics,small title 75 population receiving freereduce...,amazon fire 7 tablet 16gb blue amazon fire 7ta...
1,p251390,ca,health sports,health wellness team sports,new year new set eager learners future athlete...,12 standard scooter boards handles set 6 ball ...
2,p075249,oh,health sports,health wellness,teach low incomehigh poverty area students lot...,drive medical deluxe folding exercise peddler ...
3,p082333,fl,applied learning,character education,gifted doesnt mean perfect fact gifted student...,celestial seasonings green tea kcups authentic...
4,p256298,nc,music arts,visual arts,teacher large group 3rd 4th students esl high ...,canvas mini classroom set 180 marker crayola f...
5,p018689,mo,literacy language math science,literature writing mathematics,classroom community important teacher lowincom...,apple ipad mini 2 wifi 16gb space gray
6,p195673,ny,special needs,special needs,students wonderful group children love coming ...,apple macbook air latest model 133display inte...
7,p087650,,literacy language math science,literature writing mathematics,second grade students benefit classroom laptop...,apple macbook pro retina display latest model ...
8,p255627,pa,literacy language,esl literature writing,students class hard workers creative motivated...,apple ipad air 2 wifi 64gb space gray
9,p016219,la,literacy language,literacy literature writing,heart effective technology integration technol...,asus chromebook c300sa 133 inch intel celeron ...


In [183]:
df=pd.read_csv('clean_stopword.csv',lineterminator='\n')

In [184]:
df[-10:]

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
99990,p046251,mn,health sports,health wellness,elementary urban neighborhood students new uni...,cd370x healthy kids cd library ff518 letamp821...
99991,p008166,fl,literacy language music arts,literacy performing arts,located high poverty area 100 students receive...,califone 2395irplc6 wireless infrared cassette...
99992,p178139,al,music arts,visual arts,art teacher small title 1 300 students 90 stud...,513041006 awt portable drying rack 10 x 18 100...
99993,p163380,,health sports,health wellness,ever struggled class meeting overwhelmed numbe...,gaiam restore balance cushion
99994,p019137,wi,literacy language,literacy,indoor recess today read answer enthusiastic y...,friend lakota incredible true story wolf brave...
99995,p097303,sc,health sports,team sports,students awesome come title funding physical e...,10man flag football set champion extreme tiedy...
99996,p059002,ny,math science special needs,applied sciences special needs,students energetic 10 yearolds vivid imaginati...,samsung chromebook 3 xe500c13k01us 2 gb ram 16...
99997,p236258,ca,literacy language,literacy literature writing,living underprivileged community odds students...,composition notebook 100 pages crayola llc col...
99998,p175626,tx,literacy language math science,literature writing mathematics,work title1 texas energetic first graders eage...,2 packs universal economy sheet protectors eco...
99999,p234679,,literacy language,literacy,teaching diverse group first graders without c...,br302bu backpatter8217s seat blue br302rd back...


In [185]:
df.shape

(100000, 6)

**PROBLEM**: Now use Gensim’s `simple_preprocess()` function to tokenize and clean up your text data. TIP: `simple_preprocess()` returns a list of words, so we want to wrap it with a function that joins the list back together into a string.

In [186]:
# Define custom function to wrap c from gensim
from gensim.utils import simple_preprocess
df2 = copy.copy(df) 
def simple_preprocess_custom(txt):
    txt=simple_preprocess(txt,deacc=True)
    txt=' '.join(txt)
    return txt

In [187]:
# Apply simple_preprocess() to all features

for feature in tqdm(FEATURE_NAMES):
    df2[feature]=df2[feature].fillna('').apply(simple_preprocess_custom)

100%|██████████| 5/5 [00:59<00:00, 11.87s/it]


In [188]:
df2.head()
df2.to_csv('simple_preprocess_custom.csv', index=False)       


In [189]:
df2=pd.read_csv('simple_preprocess_custom.csv',lineterminator='\n')


In [190]:
df2.head()

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p159955,va,math science,applied sciences mathematics,small title population receiving freereduced l...,amazon fire tablet gb blue amazon fire tablet ...
1,p251390,ca,health sports,health wellness team sports,new year new set eager learners future athlete...,standard scooter boards handles set ball hop i...
2,p075249,oh,health sports,health wellness,teach low incomehigh poverty area students lot...,drive medical deluxe folding exercise peddler ...
3,p082333,fl,applied learning,character education,gifted doesnt mean perfect fact gifted student...,celestial seasonings green tea kcups authentic...
4,p256298,nc,music arts,visual arts,teacher large group rd th students esl high po...,canvas mini classroom set marker crayola fine ...


In [191]:
df2.shape

(100000, 6)

**PROBLEM**: Lemmatize the text. (Hint: Define a custom function and then apply it to all features.)

In [192]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ekaterinakryukova/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [193]:
importlib.reload(nltk)
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from collections import Counter

In [319]:
df3=copy.copy(df2)



In [266]:
df3.head(1)

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p159955,va,math science,applied sciences mathematics,small title population receiving freereduced lunch students come socioeconomic culturally diverse backgrounds rd grade total students th grade total students th grade total students students work hard achieve success provide best education students constantly looking new ways challenge engage childrenwe trying procure tablets give students access apps technology need utilize future want engage students stem curriculum early may reach full potential successful stem career environment students access technology trying provide since cannot afford technology seeking help provide opportunity students want students burn desire learn please help light fire future,amazon fire tablet gb blue amazon fire tablet gb black


In [233]:
# Write a lemmatization function based on nltk.stem.WordNetLemmatizer() -didn't complete for night


In [320]:
def get_pos( word ):
    w_synsets = wordnet.synsets(word)

    pos_counts = Counter()
    pos_counts["n"] = len(  [ item for item in w_synsets if item.pos()=="n"]  )
    pos_counts["v"] = len(  [ item for item in w_synsets if item.pos()=="v"]  )
    pos_counts["a"] = len(  [ item for item in w_synsets if item.pos()=="a"]  )
    pos_counts["r"] = len(  [ item for item in w_synsets if item.pos()=="r"]  )
    
    most_common_pos_list = pos_counts.most_common(3)
    return most_common_pos_list[0][0]

In [321]:

wnl = WordNetLemmatizer()
w_tokenizer = WhitespaceTokenizer()
def lemmatize_text(text):
    
    #lemmatize = lru_cache(maxsize=50000)(wnl.lemmatize)
    text=[wnl.lemmatize(w,get_pos(w)) for w in w_tokenizer.tokenize(text)]
    return text

In [322]:
# Apply lemmatize_text() to all features  
#As it is impossible to process all data, i will process 500 000
from tqdm import tqdm


from tqdm._tqdm_notebook import tqdm_notebook

for feature in tqdm(['essays', 'description']):
    tqdm_notebook.pandas()
    df3[feature]=df3[feature].fillna(' ').progress_apply(lemmatize_text)


  0%|          | 0/2 [00:00<?, ?it/s]

HBox(children=(FloatProgress(value=0.0, max=100000.0), HTML(value='')))

 50%|█████     | 1/2 [43:05<43:05, 2585.47s/it]




HBox(children=(FloatProgress(value=0.0, max=100000.0), HTML(value='')))

100%|██████████| 2/2 [45:33<00:00, 1366.70s/it]







In [324]:
df3.head(1)

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p159955,va,math science,applied sciences mathematics,"[small, title, population, receive, freereduced, lunch, student, come, socioeconomic, culturally, diverse, background, rd, grade, total, student, th, grade, total, student, th, grade, total, student, student, work, hard, achieve, success, provide, best, education, student, constantly, look, new, way, challenge, engage, childrenwe, try, procure, tablet, give, student, access, apps, technology, need, utilize, future, want, engage, student, stem, curriculum, early, may, reach, full, potential, successful, stem, career, environment, student, access, technology, try, provide, since, cannot, afford, technology, seek, help, provide, opportunity, student, want, student, burn, desire, learn, please, help, light, fire, future]","[amazon, fire, tablet, gb, blue, amazon, fire, tablet, gb, black]"


In [325]:
df3.head(1)

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p159955,va,math science,applied sciences mathematics,"[small, title, population, receive, freereduced, lunch, student, come, socioeconomic, culturally, diverse, background, rd, grade, total, student, th, grade, total, student, th, grade, total, student, student, work, hard, achieve, success, provide, best, education, student, constantly, look, new, way, challenge, engage, childrenwe, try, procure, tablet, give, student, access, apps, technology, need, utilize, future, want, engage, student, stem, curriculum, early, may, reach, full, potential, successful, stem, career, environment, student, access, technology, try, provide, since, cannot, afford, technology, seek, help, provide, opportunity, student, want, student, burn, desire, learn, please, help, light, fire, future]","[amazon, fire, tablet, gb, blue, amazon, fire, tablet, gb, black]"


In [368]:
df33=copy.copy(df3)

In [447]:
stopWords.extend(['learn', 'classroom','help','need','work','th','come','class','love','able','year','time','want','make'])


In [448]:
#AFTER LEMMATIZATION IT IS NEEDED TO REMOVE STTOPWORDS
from tqdm import tqdm
for feature in tqdm(['essays', 'description']):
    df33[feature]=df33[feature].fillna(' ').apply(lambda w: [i for i in w if  i  not in  stopWords])

100%|██████████| 2/2 [00:45<00:00, 22.88s/it]


In [371]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
df33.head(1)

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p159955,va,math science,applied sciences mathematics,"[small, title, population, receive, freereduced, lunch, come, socioeconomic, culturally, diverse, background, rd, grade, total, th, grade, total, th, grade, total, work, hard, achieve, success, provide, best, education, constantly, look, new, way, challenge, engage, childrenwe, try, procure, tablet, give, access, apps, technology, need, utilize, future, want, engage, stem, curriculum, early, may, reach, full, potential, successful, stem, career, environment, access, technology, try, provide, since, cannot, afford, technology, seek, help, provide, opportunity, want, burn, desire, please, help, light, fire, future]","[amazon, fire, tablet, gb, blue, amazon, fire, tablet, gb, black]"


In [372]:
df33.to_csv('lemmatize_text.csv', index=False)  

**PROBLEM**: What happened to the data in the pandas dataframe>

ANSWER: It was converted from long text into a list of individual words.

# PART 4:  Make an LDA topic model for the ESSAYS.

**Note: Part 4 is worth 10 points (the value of 10 individual problems).**

Define an LDA topic model for the `essays`. Compute the "Coherence score." Visually inspect the topic model by inspecting the top keywords from each model. Gensim provides functions for all of these tasks.  

In [449]:

# Create Dictionary
id2word = corpora.Dictionary( df33['essays'])

# Create Corpus
texts =  df33['essays']
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View unique id for each word in the essay
print(corpus[:1])


[[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), (22, 1), (23, 2), (24, 1), (25, 3), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 3), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 3), (50, 1), (51, 3), (52, 2), (53, 1), (54, 1)]]


In [450]:
# View word a given id corresponds to for id=0
id2word[0]

'access'

In [451]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('access', 2),
  ('achieve', 1),
  ('afford', 1),
  ('apps', 1),
  ('background', 1),
  ('best', 1),
  ('burn', 1),
  ('cannot', 1),
  ('career', 1),
  ('challenge', 1),
  ('childrenwe', 1),
  ('constantly', 1),
  ('culturally', 1),
  ('curriculum', 1),
  ('desire', 1),
  ('diverse', 1),
  ('early', 1),
  ('education', 1),
  ('engage', 2),
  ('environment', 1),
  ('fire', 1),
  ('freereduced', 1),
  ('full', 1),
  ('future', 2),
  ('give', 1),
  ('grade', 3),
  ('hard', 1),
  ('light', 1),
  ('look', 1),
  ('lunch', 1),
  ('may', 1),
  ('new', 1),
  ('opportunity', 1),
  ('please', 1),
  ('population', 1),
  ('potential', 1),
  ('procure', 1),
  ('provide', 3),
  ('rd', 1),
  ('reach', 1),
  ('receive', 1),
  ('seek', 1),
  ('since', 1),
  ('small', 1),
  ('socioeconomic', 1),
  ('stem', 2),
  ('success', 1),
  ('successful', 1),
  ('tablet', 1),
  ('technology', 3),
  ('title', 1),
  ('total', 3),
  ('try', 2),
  ('utilize', 1),
  ('way', 1)]]

In [452]:
#df.project_subject_categories.value_counts()

In [453]:
#I have tried 10 and 7 topics -> the best coherence score is for 5
#freeze_support()
    ##
    ##z Build Multicore LDA
lda_multicore_model = gensim.models.ldamulticore.LdaMulticore(corpus, num_topics = 5, id2word=id2word)
    # Saving trained model
    #lda_multicore_model.save('LDA_NYT_multicore')


In [454]:
lda_multicore_model.save('LDA_NYT_multicore7')

In [455]:
# Loading trained model
lda_multicore_model = gensim.models.ldamodel.LdaModel.load('LDA_NYT_multicore7')


In [456]:
# Print the Keyword in the 7 topics
print(lda_multicore_model.print_topics())
doc_lda = lda_multicore_model[corpus]

[(0, '0.009*"new" + 0.008*"technology" + 0.007*"experience" + 0.007*"day" + 0.007*"material" + 0.006*"science" + 0.006*"provide" + 0.006*"math" + 0.006*"create" + 0.006*"project"'), (1, '0.024*"read" + 0.023*"book" + 0.008*"day" + 0.007*"write" + 0.007*"new" + 0.006*"child" + 0.005*"skill" + 0.005*"material" + 0.005*"well" + 0.005*"grade"'), (2, '0.009*"get" + 0.008*"child" + 0.008*"skill" + 0.007*"day" + 0.007*"play" + 0.006*"way" + 0.006*"provide" + 0.005*"life" + 0.005*"read" + 0.005*"one"'), (3, '0.020*"read" + 0.010*"book" + 0.009*"technology" + 0.008*"project" + 0.007*"also" + 0.007*"skill" + 0.007*"grade" + 0.007*"math" + 0.006*"allow" + 0.006*"day"'), (4, '0.011*"seat" + 0.008*"provide" + 0.008*"allow" + 0.008*"create" + 0.007*"art" + 0.007*"day" + 0.007*"opportunity" + 0.007*"project" + 0.007*"sit" + 0.006*"give"')]


In [457]:
# Compute Perplexity
#print('\nPerplexity: ', lda_multicore_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_multicore_model, texts=texts, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.2879724933532581


If you use gensim and the following three variables, then you can visualize topics & keywords with the code below.

    lda_model:    this is an LDA model generated by gensim.models.ldamodel.LdaModel()
    id2word:      this is the dictionary term IDs from corpora.Dictionary()
    corpus:       this is the collection of "documents"


In [458]:
# Visualize topics-keywords
lda_model=lda_multicore_model
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

# PART 5:  Make an LDA topic model for the DESCRIPTIONS.

**Note: Part 5 is worth 5 points (the value of 5 individual problems).**

Using the same K (and any other hyperparameters from Part 4), recompute a model for Descriptions. Compare the two sets of results. Do they vary? How? Why? Explain what you find. 

In [194]:
df3['description'].head()

0                                         ['gb', 'tb']
1    ['insignia', 'overtheear', 'wireless', 'headph...
2                                         ['gb', 'tb']
3    ['apple', 'ipad', 'air', 'mh', 'lla', 'inch', ...
4    ['xyzprinting', 'rfplcxus', 'da', 'vinci', 'ju...
Name: description, dtype: object

In [196]:

# Create Dictionary
id2word_description = corpora.Dictionary([d.split() for d in df3['description']])

# Create Corpus
texts_description = [d.split() for d in df3['description']]
# Term Document Frequency
corpus_description = [id2word_description.doc2bow(text) for text in texts_description]

# View unique id for each word in the essay
print(corpus_description[:1])


[[(0, 1), (1, 1)]]


In [197]:
# 
import time
start_time = time.time()
##
## Build Multicore LDA
lda_multicore_model_d = gensim.models.ldamulticore.LdaMulticore(corpus_description, num_topics = 5, id2word=id2word_description)
# Saving trained model
lda_multicore_model_d.save('LDA_NYT_multicore_d2')
# Loading trained model
lda_multicore_model_d = gensim.models.ldamodel.LdaModel.load('LDA_NYT_multicore_d2')
## Print time taken to train the model
print("--- %s seconds ---" % (time.time() - start_time))




--- 277.0150249004364 seconds ---


In [198]:
# Print the Keyword in the 7 topics
print(lda_multicore_model_d.print_topics())
doc_lda_d = lda_multicore_model_d[corpus_description]

[(0, '0.038*"\'set\'," + 0.012*"\'bk\'," + 0.011*"\'game\'," + 0.009*"\'book\'," + 0.009*"\'learn\'," + 0.009*"\'math\'," + 0.007*"\'grade\'," + 0.006*"\'card\'," + 0.005*"\'pp\'," + 0.005*"\'read\',"'), (1, '0.020*"\'pack\'," + 0.016*"\'color\'," + 0.012*"\'set\'," + 0.012*"\'inch\'," + 0.010*"\'black\'," + 0.009*"\'assort\'," + 0.009*"\'kid\'," + 0.007*"\'ball\'," + 0.007*"\'blue\'," + 0.006*"\'marker\',"'), (2, '0.033*"\'book\'," + 0.015*"[\'gb\'," + 0.011*"\'reader\'," + 0.011*"\'level\'," + 0.009*"\'story\'," + 0.009*"\'read\'," + 0.008*"\'microsd\'," + 0.008*"\'elite\']" + 0.007*"\'tb\']" + 0.007*"\'kid\',"'), (3, '0.046*"\'gb\'," + 0.026*"\'set\'," + 0.019*"\'android\']" + 0.019*"[\'ip\'," + 0.017*"\'pack\'," + 0.016*"\'paper\'," + 0.014*"\'oz\'," + 0.013*"\'color\'," + 0.011*"\'construction\'," + 0.007*"\'amp\',"'), (4, '0.132*"\'headphone\'," + 0.130*"\'black\']" + 0.130*"\'wireless\'," + 0.127*"\'overtheear\'," + 0.127*"[\'insignia\'," + 0.026*"\'spanish\'," + 0.021*"\'editio

In [199]:
# Compute Perplexity
#print('\nPerplexity: ', lda_multicore_model_d.log_perplexity(corpus_description))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_multicore_model_d, texts=texts_description, dictionary=id2word_description, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.5647555558668073


In [200]:
# Visualize topics-keywords
lda_model=lda_multicore_model_d
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus_description, id2word_description)
vis

# PART 6:  Use TextHero and help to improve it.

**Note: This is worth 5 points (the value of 5 individual problems).**

[TextHero](https://texthero.org/) is an opensource project developed by a student from the [TIS Lab of Prof. Younge](www.epfl.ch/labs/tis). Go to the [GIT repository for TextHero](https://github.com/jbesomi/texthero), install the package, review the documentation, and if you are impressed by the package - give it a star and tell others! (Not required)

Once you understand TextHero, then use the package to re-implement major portions of Part 3 of this assignment that you completed above.  

In [201]:

!pip install texthero





In [202]:
import importlib
# Standard imports
import numpy  as np
import pandas as pd

import itertools
import random
import math  
import copy

from pprint import pprint  # nicer printing

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Other NLP
import re
import spacy
import nltk
from nltk.corpus import stopwords

# General Plotting
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as patches
%matplotlib inline  
import seaborn as sns
sns.set(style="white")

# Special Plotting
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

# ignore some warnings 
import warnings
warnings.filterwarnings('ignore')

# Set the maximum number of rows displayed by pandas
pd.options.display.max_rows = 1000

# Set some CONSTANTS that will be used later
SEED    = 41  # base to generate a random number
SCORE   = 'roc_auc'
FIGSIZE = (16, 10)

In [203]:
import texthero as hero


In [224]:
data2=copy.copy(merged_df_textual)

In [225]:
data2=data2.sample(n=100000, random_state=1)

In [226]:
data2.head(4)

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
1014941,p245721,CT,"Literacy & Language, Music & The Arts","Literacy, Visual Arts",It is a new year and we have a new attitude --...,Binney & Smith Crayola Classpack Colored Penci...
727382,p023718,CO,Literacy & Language,Literacy,"I am a teacher at a school in Aurora, Colorado...",LL610X - Nonfiction Leveled Books Classroom Li...
846546,p145134,UT,Health & Sports,"Health & Wellness, Nutrition Education","I teach at a charter school full of bright, en...",Annie's Homegrown Organic Vegan Fruit Snacks V...
325397,p147383,CA,"Math & Science, Applied Learning","Applied Sciences, College & Career Prep",I want my students to have access to the techn...,Insignia? - 10.1'- Tablet - 32GB Insignia? - F...


In [227]:
from tqdm import tqdm

from tqdm._tqdm_notebook import tqdm_notebook
for column in tqdm(data2.columns):
    tqdm_notebook.pandas()
    data2[column] = data2[column].pipe(hero.clean)

100%|██████████| 6/6 [01:12<00:00, 12.08s/it]


In [228]:
data2.head()

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
1014941,p245721,ct,literacy language music arts,literacy visual arts,new year new attitude want succeed gotten know...,binney smith crayola classpack colored pencils...
727382,p023718,co,literacy language,literacy,teacher school aurora colorado school expediti...,ll610x nonfiction leveled books classroom libr...
846546,p145134,ut,health sports,health wellness nutrition education,teach charter school full bright energetic cur...,annie homegrown organic vegan fruit snacks var...
325397,p147383,ca,math science applied learning,applied sciences college career prep,want students access technology available teac...,insignia tablet 32gb insignia flexview folio c...
215153,p216238,nj,literacy language,literacy,students come loving homes inner city english ...,ar802 interactive language notebook reproducib...


In [230]:
data2.head()

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
1014941,p245721,ct,literacy language music arts,literacy visual arts,new year new attitude want succeed gotten know...,binney smith crayola classpack colored pencils...
727382,p023718,co,literacy language,literacy,teacher school aurora colorado school expediti...,ll610x nonfiction leveled books classroom libr...
846546,p145134,ut,health sports,health wellness nutrition education,teach charter school full bright energetic cur...,annie homegrown organic vegan fruit snacks var...
325397,p147383,ca,math science applied learning,applied sciences college career prep,want students access technology available teac...,insignia tablet 32gb insignia flexview folio c...
215153,p216238,nj,literacy language,literacy,students come loving homes inner city english ...,ar802 interactive language notebook reproducib...


In [None]:
#kernel died even though i tried 5 ttimes
"""data2['tfidf'] = (
    hero.tfidf(data2['essays']))"""

In [232]:
#kernel died even though i tried 5 ttimes
"""
data2['pca'] = hero.pca(data2['tfidf'])
hero.scatterplot(
    df, 
    col='pca', 
    color='topic', 
    title="PCA BBC Sport news"
)"""

'\ndata2[\'pca\'] = hero.pca(data2[\'tfidf\'])\nhero.scatterplot(\n    df, \n    col=\'pca\', \n    color=\'topic\', \n    title="PCA BBC Sport news"\n)'

**Note: This is worth 5 points (the value of 5 individual problems).**

OpenSourcve packages rely on the community of users to help them grow and improve. Review the [contributing file](https://github.com/jbesomi/texthero/blob/master/CONTRIBUTING.md) for Text Hero and then identify a portion of the documentation that you feel could be improves. Edit/write 1 paragraph of documentation for the package that you believe would improve it. Copy that paragraph in below (into this notebook) so that it can be graded. And - if you think your contribution would truly help the project, please learn how to use git to suggest the change (a pull request) to the manager of the repository (Jonathan Besomi). 

Documentation for nmf

texthero.representation.nmf

nmf(s, n_components=2)

    Perform non-negative matrix factorization.

Find two non-negative matrices (W, H) whose product approximates the non-negative matrix X. 
This factorization can be used for example for dimensionality reduction, source separation or topic extraction.

Parameters

    s: Pandas Series
    n_components: Int. Default is 2.
        Number of components to keep. If n_components is not set or None, all components are kept.

Examples

import texthero as hero

import pandas as pd

s = pd.Series(["Sentence one", "Sentence two"])

custom_pipeline = [preprocessing.lowercase,
                   preprocessing.remove_whitespace]
s= hero.clean(s, custom_pipeline)
s_tf_idf = hero.tfidf(s)
pca=hero.nmf(s_tf_idf)

Documentation for pca

Example:
    
import texthero as hero

import pandas as pd

s = pd.Series(["Sentence one", "Sentence two"])

custom_pipeline = [preprocessing.lowercase,
                   preprocessing.remove_whitespace]
                   
s= hero.clean(s, custom_pipeline)

s_tf_idf = hero.tfidf(s)

pca=hero.pca(s_tf_idf)
