# DSFB Assignment 5

In this assignment, you will begin to work with text data and natural language processing. You will analyze aspects of th DonorsChoose.org program. Aspects of this project were first posed as a Kaggle challenge and the data comes from [Kaggle DonorsChoose.org Application Screening challenge](https://www.kaggle.com/c/donorschoose-application-screening/data). We have changed the nature of what you need to do in this assignment (so it does not track what was done in the Kaggle Challenge), but nevertheless using or referring to the Kaggle Challenge repository is not allowed for the assignment.

###  DonorsChoose.org  
  
Founded in 2000 by a high school teacher in the Bronx, DonorsChoose.org empowers public school teachers from across the country to request much-needed materials and experiences for their students. At any given time, there are thousands of classroom requests that can be brought to life with a gift of any amount. DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website. In this assignment, you will analyze the text of the essays and requirements from each proposal.

<img src="https://cached.imagescaler.hbpl.co.uk/resize/scaleWidth/580/cached.offlinehbpl.hbpl.co.uk/news/NST/C8B9CC1D-03B0-9B80-4CFE78B5B539240F.jpg" width="500" height="500" align="center"/>

Image source: https://cached.imagescaler.hbpl.co.uk/resize/scaleWidth/580/cached.offlinehbpl.hbpl.co.uk/news/NST/C8B9CC1D-03B0-9B80-4CFE78B5B539240F.jpg

### Data

As you will see, this dataset includes many different kinds of features with structured and unstructured data. The dataset consists of application materials (see *application_data.csv*) and resources requested (see *resource_data.csv*). The application materials (see *application_data.csv*) contain the following features.

| Feature name  | Description  |
|----------------|--------------|
| id  | Unique id of the project application    |
| teacher_id    | id of the teacher submitting the application  |
| teacher_prefix    | title of the teacher's name (Ms., Mr., etc.)    |
| school_state    | US state of the teacher's school    |
| project_submitted_datetime    | application submission timestamp    |
| project_grade_category    | school grade levels (PreK-2, 3-5, 6-8, and 9-12)   |
| project_subject_categories   | category of the project (e.g., "Music & The Arts")    |
| project_subject_subcategories    | sub-category of the project (e.g., "Visual Arts")    |
| project_title    | title of the project    |
| project_essay_1    | first essay*   |
| project_essay_2    | second essay*    |
| project_essay_3    | third essay*   |
| project_essay_4    | fourth essay*  |
| project_resource_summary    | summary of the resources needed for the project    |
| teacher_number_of_previously_posted_projects   | number of previously posted applications by the submitting teacher    |
| project_is_approved    | whether DonorsChoose proposal was accepted (0="rejected", 1="accepted"); train.csv only    |


\*Note: Prior to May 17, 2016, the prompts for the essays were as follows:

  * project_essay_1: "Introduce us to your classroom"  

  * project_essay_2: "Tell us more about your students"  

  * project_essay_3: "Describe how your students will use the materials you're requesting"  

  * project_essay_4: "Close by sharing why your project will make a difference"  

Starting on May 17, 2016, the number of essays was reduced from 4 to 2, and the prompts for the first 2 essays were changed to the following:

  * project_essay_1: "Describe your students: What makes your students special? Specific details about their background, your neighborhood, and your school are all helpful."  

  * project_essay_2: "About your project: How will these materials make a difference in your students' learning and improve their school lives?"  

For all projects with project_submitted_datetime of 2016-05-17 and later, the values of project_essay_3 and project_essay_4 will be missing (i.e. NaN).


### Special NLP Libraries

We will use several new libraries for this assignment - so be sure to first install those on your machine by with `pip` in a terminal:

    pip install --user -U nltk
    pip install -U gensim
    pip install -U spacy
    pip install -U pyldavis

## IMPORTS

In [467]:
import importlib
# Standard imports
import numpy  as np
import pandas as pd

import itertools
import random
import math  
import copy

from pprint import pprint  # nicer printing

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Other NLP
import re
import spacy
import nltk
from nltk.corpus import stopwords

# General Plotting
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as patches
%matplotlib inline  
import seaborn as sns
sns.set(style="white")

# Special Plotting
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

# ignore some warnings 
import warnings
warnings.filterwarnings('ignore')

# Set the maximum number of rows displayed by pandas
pd.options.display.max_rows = 1000

# Set some CONSTANTS that will be used later
SEED    = 41  # base to generate a random number
SCORE   = 'roc_auc'
FIGSIZE = (16, 10)

# PART 1: Prep

**PROBLEM**: To use a particular model in the `spacy` package, you need to manually download and install that particular model. You will need to run the following code from a terminal: `python -m spacy download en_core_web_sm`. Rather than doing that manually from bash in a separate terminal program, do it inline below using a "magic" command in jupyter. HINT: Use *!* followed by a bash command in a cell to run a bash command.

In [468]:
# Download en_core_web_sm for spacy

!python3 -m spacy download en_core_web_sm

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3 -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


**PROBLEM**: To confirm that `spacy` is working (and `en_core_web_sm` is installed on your computer), you should be able to use `spacy.load()` to build a `Language` object to perform some basic nlp. Do that below:

In [469]:
# Test use of spacy by using the spacy.load() function
import spacy
import en_core_web_sm
nlp = spacy.load('en_core_web_sm')

**PROBLEM**: Use nltk.download() to download a list of raw stopwords. (see NLTK documentation)

In [470]:
# Download NLTK stopwords

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ekaterinakryukova/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**PROBLEM**: Use the `stopwords` object from `nltk` to build a list of English stopwords. 

In [471]:
# Get English Stopwords from NLTK
from nltk.corpus import stopwords
stopWords = stopwords.words('english')

In [472]:
print(len(stopWords))

179


**PROBLEM**: Extend your `stop_words` list with some additional stopwords that you believe should be ignored in this particular context.

In [473]:
# Extend the stop word list  

stopWords.extend(['from', 'subject', 're', 'edu', 'use'])

print(len(stopWords))

184


### Download the Data

Unlike other projects, this project includes a training set too big for GitHub. Through the terminal lab of Jupyter lab, download the data using the *wget* command, unzip it using the *zip* command and check that it's in the root directory of the project. 

Locations : 

    Applications dataset: https://storage.googleapis.com/dsfm-datasets/text-applications/application_data.csv.zip
    Resources dataset: https://storage.googleapis.com/dsfm-datasets/text-applications/resource_data.csv.zip
    
Hint: Use *wget* and *unzip* commands. Use *!* followed by a bash command in a cell to run a bash command.

**PROBLEM**: wget the data

In [474]:
# wget the data
import wget
wget.download('https://storage.googleapis.com/dsfm-datasets/text-applications/application_data.csv.zip','data')

'data/application_data.csv (1).zip'

In [475]:
wget.download('https://storage.googleapis.com/dsfm-datasets/text-applications/resource_data.csv.zip','data')

'data/resource_data.csv (1).zip'

**PROBLEM**: unzip the data

In [476]:
# unzip the data
from zipfile import ZipFile
zip = ZipFile('data/application_data.csv.zip')
zip.extractall('data/application_data')

In [477]:
zip = ZipFile('data/resource_data.csv.zip')
zip.extractall('data/resource_data')


# PART 2: Load Data

**PROBLEM**: Load `application_data.csv` and investigate it a bit.

In [478]:
# Load applications
application_data = pd.read_csv('data/application_data/application_data.csv',parse_dates=['project_submitted_datetime'])
application_data.head(5)


Unnamed: 0,id,teacher_id,teacher_prefix,school_state,project_submitted_datetime,project_grade_category,project_subject_categories,project_subject_subcategories,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved
0,p036502,484aaf11257089a66cfedc9461c6bd0a,Ms.,NV,2016-11-18 14:45:59,Grades PreK-2,Literacy & Language,Literacy,Super Sight Word Centers,"Most of my kindergarten students come from low-income households and are considered \""at-risk\"". These kids walk to school alongside their parents and most have never been further than walking distance from their house. For 80% of my students, English is not their first language or the language spoken at home. \r\n\r\nWhile my kindergarten kids have many obstacles in front of them, they come to school each day excited and ready to learn. Most students started the year out never being in a school setting. At the start of the year many had never been exposed to letters. Each day they soak up more knowledge and try their hardest to succeed. They are highly motivated to learn new things every day. We are halfway through the year and they are starting to take off. They know know all letters, some sight words, numbers to 20, and a majority of their letter sounds because of their hard work and determination. I am excited to see the places we will go from here!",I currently have a differentiated sight word center that we do daily during our literacy stations. The students have activities that relate to whatever sight word list they are on. This is one of their favorite station activities. I want to continue to provide the students with engaging ways to practice their sight words. \r\n\r\nI dream of having the students use QR readers to scan the sight words that they are struggling with and the Ipods reading the sight words with them. This would help so many of my students by giving them multiple exposures to the words. My students need someone who can go over these sight words daily and I can't always get around to everyone to practice their flashcards with them. With the Ipods they would still have a way to practice their sight words on a daily basis.,,,My students need 6 Ipod Nano's to create and differentiated and engaging way to practice sight words during a literacy station.,26,1
1,p039565,df72a3ba8089423fa8a94be88060f6ed,Mrs.,GA,2017-04-26 15:57:28,Grades 3-5,"Music & The Arts, Health & Sports","Performing Arts, Team Sports",Keep Calm and Dance On,"Our elementary school is a culturally rich school, with a diverse population of 580 students, in Pre-K through sixth grade.\r\nOur Title I school population has 92% of students qualifying for free or reduced priced lunches and a high concentration of English Learners. We also serve two foster group homes for temporary and long-term placement of homeless children. \r\nWe do not see these statistics as road blocks. We see them as additions to our rich diversity. Together we will help students to develop to their fullest potential: Creative, problem-solving, compassionate adults.","We strive to provide our diverse population of students with not only extra curricular activities, but an outlet for them to express themselves creatively.As a teacher, I have organized a dance club for lower elementary that meets once a week after school.\r\n\r\n This gives the girls something to look forward to, fosters the education of the whole child, and creates a social environment for our varied cultural student body. \r\n\r\nSince beginning our dance club, I have watched our girls who normally are introverts, bloom with excitement. They are also choreographing dances and productions in content areas during the school day.",,,My students need matching shirts to wear for dance performances and competitions.,1,0
2,p233823,a9b876a9252e08a55e3d894150f75ba3,Ms.,UT,2017-01-01 22:57:44,Grades 3-5,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",Lets 3Doodle to Learn,Hello;\r\nMy name is Mrs. Brotherton. I teach 5th grade at Ascent Academy in Utah. We are a wonderful charter school that uses the students' interests to help them learn. We are always looking for wonderful teaching methods to help students. \r\nMy students are wonderful. I have several levels in my class. Our school is big on curriculum compacting and helping gifted kids move faster and struggling kids get the extra help they need. We would benefit so much from your donation. Every little bit helps us reach our goal. I teach Science to all 88 5th grade students and have clusters of 25 new students every 8 weeks. Your donation will help more than just one 5th grade class it will help several students not to mention the students in years to come.\r\n,We are looking to add some 3Doodler to our classroom. It would be wonderful to have our own set of 3Doodler. In order to help our students achieve our mission. Our school is big on using technology to help students express what they have learned through a medium of their choice. Having these 3Doodler in our class my fast and advanced learners will be able to go ahead and start the project while I help the other students get their. The 3Doodler will also allow each one of my students find a medium to help them learn and retain the knowledge.,,,My students need the 3doodler. We are an SEM school which means students learn using a profile that tells them what way they learn best. Having this 3doodler will help my students learn.,5,1
3,p185307,525fdbb6ec7f538a48beebaa0a51b24f,Mr.,NC,2016-08-12 15:42:11,Grades 3-5,Health & Sports,Health & Wellness,"\""Kid Inspired\"" Equipment to Increase Activities and Gain Better Health","My students are the greatest students but are socially and economically disadvantaged. We are an inner city school being limited to doing all activities in (PE) Physical Education inside because we have violence at the location where our school is located. All the physical activities the students are active with are within the school so we have to have a good program.\r\n\r\nMost of the students are either African American or Hispanic. The students range from being enrolled as a kindergarten-8th grade. Since Physical Education is important with one's success in school; all the students have PE class Monday through Friday every day. The proper equipment in PE is not always possible so this is why we are here asking for your help with shelving.\r\n\r\nThis project is \""kid-inspired\"" in that they want better fitness. They look back at their PE class and said that is only 30 minutes each day. They go to an extended day school year-round school which after 3:00 in the afternoon they have 30 minutes for more physical activity. They said they need more equipment to be active with. The students being \""kid inspired\"" want a variety of equipment for life changing physical activity.","The student's project which is totally \""kid-inspired\""decided they needed more equipment to keep the them active and gaining better health. This \""kid-inspired\"", they realized they needed 60 minutes of physical activity in school, and they are going to make it happen. They get 30 minutes in Pe daily so they have to come up with 30 minutes more. They are not getting it when they go home because their extended day at school. The students said they need more equipment because they do not have what is needed in school presently. The variety of activity equipment,a large variety of balls, and a parachute is what the students are asking for. This equipment will increase peer-to-peer learning among students. This is very important building leaders within our school and community.\r\nIncreasing activities in all the students will help each student have better health with 60 minutes of daily physical activity. The physical activity is what all the students are talking about. Getting more!\r\nThe peer-to-peer learning building leaders and happiness within our school being everyone is active. This activity will only be possible as \""kid-inspired\"" with the equipment they are asking for.",,,"My students need balls and other activity equipment to meet the needs of this \""kid Inspired\"" project for them to increase their physical activity. The different balls, parachute, and activity equipment will do it increasing their.",16,0
4,p013780,a63b5547a7239eae4c1872670848e61a,Mr.,CA,2016-08-06 09:09:11,Grades 6-8,Health & Sports,Health & Wellness,We need clean water for our culinary arts class!,"My students are athletes and students who are interested in health and physical activity. In my elective class my students have a garden in which we grow our own food and make healthy meals within the kitchen. My students love cooking from scratch and being creative with their meals. Most of my students don't know how to cook from scratch, but they learn all of the basics in my classroom. If you don't have health, nothing else seems to matter.","For some reason in our kitchen the water comes out from the faucet white looking and not clear like most water that comes from the faucet. We are not exactly sure why that is, but the students are wary of using that water for their cooking or drinking. After much online research we feel that the Berkey Water Filtration system would be an ideal solution to our water problem. Although the water in Compton is probably fine to drink, it would be better for us to err on the side of caution and have our water filtered before we cook with it or drink it.",,,My students need a water filtration system for our culinary arts class.,42,1


In [479]:
application_data.shape

(182080, 16)

In [480]:
#type of date column
application_data.project_submitted_datetime.dtypes

dtype('<M8[ns]')

In [481]:
#number of nan values
application_data.project_essay_1.isna().sum(),application_data.project_essay_2.isna().sum(),application_data.project_essay_3.isna().sum(),application_data.project_essay_4.isna().sum(),

(0, 0, 175706, 175706)

**PROBLEM**: Load `resource_data.csv` and investigate it a bit.

In [482]:
# Load resources

resource_data = pd.read_csv('data/resource_data/resource_data.csv')
resource_data.head(5)

Unnamed: 0,id,description,quantity,price
0,p233245,LC652 - Lakeshore Double-Space Mobile Drying Rack,1,149.0
1,p069063,Bouncy Bands for Desks (Blue support pipes),3,14.95
2,p069063,Cory Stories: A Kid's Book About Living With Adhd,1,8.45
3,p069063,"Dixon Ticonderoga Wood-Cased #2 HB Pencils, Box of 96, Yellow (13872)",2,13.59
4,p069063,"EDUCATIONAL INSIGHTS FLUORESCENT LIGHT FILTERS (TRANQUIL BLUE), SET OF 4",3,24.95


**PROBLEM**: Some of the essays are NA. Replace NAs with empty strings.

In [483]:
# Replace NA values in essay columns with ''

application_data[['project_essay_3','project_essay_4']]=application_data[['project_essay_3',
                  'project_essay_4']].replace(np.nan, '', regex=True)

In [484]:
#count nan
application_data.project_essay_1.isna().sum(),application_data.project_essay_2.isna().sum(),application_data.project_essay_3.isna().sum(),application_data.project_essay_4.isna().sum(),

(0, 0, 0, 0)

In [485]:
application_data.columns

Index(['id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category',
       'project_subject_categories', 'project_subject_subcategories',
       'project_title', 'project_essay_1', 'project_essay_2',
       'project_essay_3', 'project_essay_4', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved'],
      dtype='object')

**PROBLEM**: To simplify matters, combine all essays into just one feature called "essays"

In [486]:
# Combine essays
application_data['essays']=application_data['project_essay_{}'.format(1)]
for i in range(2,5):
    application_data['essays']+=application_data['project_essay_{}'.format(i)].astype(str)

In [487]:
#get data with all essays
application_data[application_data.project_submitted_datetime<'2016-05-17'].head(1)

Unnamed: 0,id,teacher_id,teacher_prefix,school_state,project_submitted_datetime,project_grade_category,project_subject_categories,project_subject_subcategories,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved,essays
18,p232007,e7a8f866e3174a77ffe37323f032a8ac,Mrs.,FL,2016-04-27 09:58:04,Grades PreK-2,"Applied Learning, Literacy & Language","College & Career Prep, Literature & Writing",Watch Readers Grow!,"During our reading workshop students are at daily 5. My students need activities to help them practice skills in a fun and enjoyable way that is on the level of each child. As the teacher I enjoy conferencing with each student, so the more engaged the students are practicing the skills they.",My students lack confidence. I have a class with such great potential. My students need more hands on learning and extra practice to catch and grow a love for reading. My second graders love to learn. We have resources but most are out dated. My students would be so bright if they could only build confidence. I believe in them and now I want to see them bloom.,"During our reading workshop. We do mini lessons that focus on a skill and then they rotate through daily 5 (centers). These reading activities will help them apply what they learned and reinforce the skills. They rotate through self read, buddy read (carpet) listening, writing, word work (reading skills). If the class is actively engaged this leave more time for small group learning and conferencing with students. The giant magnet words will help at word work to build sentences and use the correct parts of speech. The classroom carpet will be for read to self where students can cozy up with a good book. Read with a pen close reads will be used during buddy read where students can zoom into the meanings and what is being read. The language skill center and quick picks will be used for our writing center and will help students with the skills they need to be great writers. All these materials will help my students grow doing our daily readers workshop.",Your donations would greatly be a blessing for all my students. They focus on all the daily 5 rotations as well as help them grow. These reading activities will keep students engaged so the teacher can work and help students master the skills they are week in. The reading genre carpet will be used to enjoy a cozy place to sit and read. The close reads will be at buddy read so that students can help each other focus on what is being read. Thank you in advance for helping them grow as reader.,My students need these reading materials to help them learn in a fun and motivating way. With this reading genre carpet and reading and writing materials they will grow.,6,1,"During our reading workshop students are at daily 5. My students need activities to help them practice skills in a fun and enjoyable way that is on the level of each child. As the teacher I enjoy conferencing with each student, so the more engaged the students are practicing the skills they.My students lack confidence. I have a class with such great potential. My students need more hands on learning and extra practice to catch and grow a love for reading. My second graders love to learn. We have resources but most are out dated. My students would be so bright if they could only build confidence. I believe in them and now I want to see them bloom.During our reading workshop. We do mini lessons that focus on a skill and then they rotate through daily 5 (centers). These reading activities will help them apply what they learned and reinforce the skills. They rotate through self read, buddy read (carpet) listening, writing, word work (reading skills). If the class is actively engaged this leave more time for small group learning and conferencing with students. The giant magnet words will help at word work to build sentences and use the correct parts of speech. The classroom carpet will be for read to self where students can cozy up with a good book. Read with a pen close reads will be used during buddy read where students can zoom into the meanings and what is being read. The language skill center and quick picks will be used for our writing center and will help students with the skills they need to be great writers. All these materials will help my students grow doing our daily readers workshop.Your donations would greatly be a blessing for all my students. They focus on all the daily 5 rotations as well as help them grow. These reading activities will keep students engaged so the teacher can work and help students master the skills they are week in. The reading genre carpet will be used to enjoy a cozy place to sit and read. The close reads will be at buddy read so that students can help each other focus on what is being read. Thank you in advance for helping them grow as reader."


In [488]:
#check
application_data['essays'][18]

'During our reading workshop students are at daily 5. My students need activities to help them practice skills in a fun and enjoyable way that is on the level of each child. As the teacher I enjoy conferencing with each student, so the more engaged the students are practicing the skills they.My students lack confidence. I have a class with such great potential. My students need more hands on learning and extra practice to catch and grow a love for reading. My second graders love to learn. We have resources but most are out dated. My students would be so bright if they could only build confidence. I believe in them and now I want to see them bloom.During our reading workshop. We do mini lessons that focus on a skill and then they rotate through daily 5 (centers). These reading activities will help them apply what they learned and reinforce the skills. They rotate through self read, buddy read (carpet) listening, writing,  word work (reading skills). If the class is actively engaged this

In [489]:
for i in range(1,5):
    print(application_data['project_essay_{}'.format(i)][18])

During our reading workshop students are at daily 5. My students need activities to help them practice skills in a fun and enjoyable way that is on the level of each child. As the teacher I enjoy conferencing with each student, so the more engaged the students are practicing the skills they.
My students lack confidence. I have a class with such great potential. My students need more hands on learning and extra practice to catch and grow a love for reading. My second graders love to learn. We have resources but most are out dated. My students would be so bright if they could only build confidence. I believe in them and now I want to see them bloom.
During our reading workshop. We do mini lessons that focus on a skill and then they rotate through daily 5 (centers). These reading activities will help them apply what they learned and reinforce the skills. They rotate through self read, buddy read (carpet) listening, writing,  word work (reading skills). If the class is actively engaged thi

In [490]:

#drop separate columns of essays
for i in range(1,5):
    application_data.drop(columns=['project_essay_{}'.format(i)],inplace=True)

In [491]:
application_data.columns

Index(['id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category',
       'project_subject_categories', 'project_subject_subcategories',
       'project_title', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'essays'],
      dtype='object')

In [492]:
application_data.columns

Index(['id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category',
       'project_subject_categories', 'project_subject_subcategories',
       'project_title', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'essays'],
      dtype='object')

In [493]:
application_data.shape

(182080, 13)

In [494]:
resource_data.head()

Unnamed: 0,id,description,quantity,price
0,p233245,LC652 - Lakeshore Double-Space Mobile Drying Rack,1,149.0
1,p069063,Bouncy Bands for Desks (Blue support pipes),3,14.95
2,p069063,Cory Stories: A Kid's Book About Living With Adhd,1,8.45
3,p069063,"Dixon Ticonderoga Wood-Cased #2 HB Pencils, Box of 96, Yellow (13872)",2,13.59
4,p069063,"EDUCATIONAL INSIGHTS FLUORESCENT LIGHT FILTERS (TRANQUIL BLUE), SET OF 4",3,24.95


**PROBLEM**: Merge the resources and application datasets on the *id* feature.

In [495]:
resource_data=resource_data.fillna(' ')
resource_data=resource_data.drop_duplicates()
resource_data['description']=resource_data.groupby(['id'])['description'].transform(lambda x : ' '.join(x)) 


In [496]:
# Merge two datasets


merged_df=application_data.merge(resource_data, on='id', how='left')
# Check the data to confirm it worked

merged_df.columns


Index(['id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category',
       'project_subject_categories', 'project_subject_subcategories',
       'project_title', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'essays', 'description', 'quantity', 'price'],
      dtype='object')

In [497]:
merged_df.shape

(1073254, 16)

In [498]:
resource_data.shape,resource_data.id.nunique()

((1528928, 4), 260115)

In [499]:
application_data.shape

(182080, 13)

In [500]:
application_data.id.nunique()

182080

**PROBLEM**: Keep the following data for additional analysis (the id and the text features): `id`, `school_state`, `project_subject_categories`, `project_subject_subcategories`, `essays`, `description`

In [501]:
FEATURE_NAMES = ['school_state', 'project_subject_categories', 'project_subject_subcategories', 'essays', 'description']

In [502]:
# Keep the Text Featuresss

merged_df_textual=merged_df[['id']+FEATURE_NAMES]

In [503]:
merged_df_textual.head()

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p036502,NV,Literacy & Language,Literacy,"Most of my kindergarten students come from low-income households and are considered \""at-risk\"". These kids walk to school alongside their parents and most have never been further than walking distance from their house. For 80% of my students, English is not their first language or the language spoken at home. \r\n\r\nWhile my kindergarten kids have many obstacles in front of them, they come to school each day excited and ready to learn. Most students started the year out never being in a school setting. At the start of the year many had never been exposed to letters. Each day they soak up more knowledge and try their hardest to succeed. They are highly motivated to learn new things every day. We are halfway through the year and they are starting to take off. They know know all letters, some sight words, numbers to 20, and a majority of their letter sounds because of their hard work and determination. I am excited to see the places we will go from here!I currently have a differentiated sight word center that we do daily during our literacy stations. The students have activities that relate to whatever sight word list they are on. This is one of their favorite station activities. I want to continue to provide the students with engaging ways to practice their sight words. \r\n\r\nI dream of having the students use QR readers to scan the sight words that they are struggling with and the Ipods reading the sight words with them. This would help so many of my students by giving them multiple exposures to the words. My students need someone who can go over these sight words daily and I can't always get around to everyone to practice their flashcards with them. With the Ipods they would still have a way to practice their sight words on a daily basis.",Apple - iPod nano� 16GB MP3 Player (8th Generation - Latest Model) - Blue Apple - iPod nano� 16GB MP3 Player (8th Generation - Latest Model) - Silver
1,p036502,NV,Literacy & Language,Literacy,"Most of my kindergarten students come from low-income households and are considered \""at-risk\"". These kids walk to school alongside their parents and most have never been further than walking distance from their house. For 80% of my students, English is not their first language or the language spoken at home. \r\n\r\nWhile my kindergarten kids have many obstacles in front of them, they come to school each day excited and ready to learn. Most students started the year out never being in a school setting. At the start of the year many had never been exposed to letters. Each day they soak up more knowledge and try their hardest to succeed. They are highly motivated to learn new things every day. We are halfway through the year and they are starting to take off. They know know all letters, some sight words, numbers to 20, and a majority of their letter sounds because of their hard work and determination. I am excited to see the places we will go from here!I currently have a differentiated sight word center that we do daily during our literacy stations. The students have activities that relate to whatever sight word list they are on. This is one of their favorite station activities. I want to continue to provide the students with engaging ways to practice their sight words. \r\n\r\nI dream of having the students use QR readers to scan the sight words that they are struggling with and the Ipods reading the sight words with them. This would help so many of my students by giving them multiple exposures to the words. My students need someone who can go over these sight words daily and I can't always get around to everyone to practice their flashcards with them. With the Ipods they would still have a way to practice their sight words on a daily basis.",Apple - iPod nano� 16GB MP3 Player (8th Generation - Latest Model) - Blue Apple - iPod nano� 16GB MP3 Player (8th Generation - Latest Model) - Silver
2,p039565,GA,"Music & The Arts, Health & Sports","Performing Arts, Team Sports","Our elementary school is a culturally rich school, with a diverse population of 580 students, in Pre-K through sixth grade.\r\nOur Title I school population has 92% of students qualifying for free or reduced priced lunches and a high concentration of English Learners. We also serve two foster group homes for temporary and long-term placement of homeless children. \r\nWe do not see these statistics as road blocks. We see them as additions to our rich diversity. Together we will help students to develop to their fullest potential: Creative, problem-solving, compassionate adults.We strive to provide our diverse population of students with not only extra curricular activities, but an outlet for them to express themselves creatively.As a teacher, I have organized a dance club for lower elementary that meets once a week after school.\r\n\r\n This gives the girls something to look forward to, fosters the education of the whole child, and creates a social environment for our varied cultural student body. \r\n\r\nSince beginning our dance club, I have watched our girls who normally are introverts, bloom with excitement. They are also choreographing dances and productions in content areas during the school day.",Reebok Girls' Fashion Dance Graphic T-Shirt - Dd Dark Heather Grey - L
3,p233823,UT,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",Hello;\r\nMy name is Mrs. Brotherton. I teach 5th grade at Ascent Academy in Utah. We are a wonderful charter school that uses the students' interests to help them learn. We are always looking for wonderful teaching methods to help students. \r\nMy students are wonderful. I have several levels in my class. Our school is big on curriculum compacting and helping gifted kids move faster and struggling kids get the extra help they need. We would benefit so much from your donation. Every little bit helps us reach our goal. I teach Science to all 88 5th grade students and have clusters of 25 new students every 8 weeks. Your donation will help more than just one 5th grade class it will help several students not to mention the students in years to come.\r\nWe are looking to add some 3Doodler to our classroom. It would be wonderful to have our own set of 3Doodler. In order to help our students achieve our mission. Our school is big on using technology to help students express what they have learned through a medium of their choice. Having these 3Doodler in our class my fast and advanced learners will be able to go ahead and start the project while I help the other students get their. The 3Doodler will also allow each one of my students find a medium to help them learn and retain the knowledge.,3doodler Start Full Edu Bundle
4,p185307,NC,Health & Sports,Health & Wellness,"My students are the greatest students but are socially and economically disadvantaged. We are an inner city school being limited to doing all activities in (PE) Physical Education inside because we have violence at the location where our school is located. All the physical activities the students are active with are within the school so we have to have a good program.\r\n\r\nMost of the students are either African American or Hispanic. The students range from being enrolled as a kindergarten-8th grade. Since Physical Education is important with one's success in school; all the students have PE class Monday through Friday every day. The proper equipment in PE is not always possible so this is why we are here asking for your help with shelving.\r\n\r\nThis project is \""kid-inspired\"" in that they want better fitness. They look back at their PE class and said that is only 30 minutes each day. They go to an extended day school year-round school which after 3:00 in the afternoon they have 30 minutes for more physical activity. They said they need more equipment to be active with. The students being \""kid inspired\"" want a variety of equipment for life changing physical activity.The student's project which is totally \""kid-inspired\""decided they needed more equipment to keep the them active and gaining better health. This \""kid-inspired\"", they realized they needed 60 minutes of physical activity in school, and they are going to make it happen. They get 30 minutes in Pe daily so they have to come up with 30 minutes more. They are not getting it when they go home because their extended day at school. The students said they need more equipment because they do not have what is needed in school presently. The variety of activity equipment,a large variety of balls, and a parachute is what the students are asking for. This equipment will increase peer-to-peer learning among students. This is very important building leaders within our school and community.\r\nIncreasing activities in all the students will help each student have better health with 60 minutes of daily physical activity. The physical activity is what all the students are talking about. Getting more!\r\nThe peer-to-peer learning building leaders and happiness within our school being everyone is active. This activity will only be possible as \""kid-inspired\"" with the equipment they are asking for.",BALL PG 4'' POLY SET OF 6 COLORS BALL PLAYGROUND POLY 8.5'' SET OF 6 KIT JUMBO GRADESTUFF PACK PARACHUTE GRIPSTARCHUTE 24 RECESS PACK GRADE K VIOLET


In [504]:
merged_df_textual=merged_df_textual.drop_duplicates()

In [505]:
merged_df_textual.shape

(182080, 6)

In [506]:
#merged_df_textual=merged_df_textual.sample(n=100000, random_state=1)
merged_df_textual.to_csv('merged_df_textual.csv',index=False)

# PART 3: Preprocess Text

Make an independent copy of the data so we can restart here when testing...

In [507]:
data = copy.copy(merged_df_textual).fillna(' ')  # when "merged" is the pandas dataframe

**PROBLEM**: Define a custom function `clean_punctuation()` to remove some punctuation from your text data. You don't have to do absolutely everything one might want to do - just show that you can do it. Start with each some easy operations with `str.replace()`.

In [508]:
# Define a custom function to clean punctuation from  given text

def clean_punctuation(txt):
    txt=txt.replace('&', ' ')
    txt=txt.replace('.', ' ')
    txt=txt.replace("\\r\\n", " ")
    return txt

**PROBLEM**: Use the `apply()` function from pandas to _apply_ that function down the `essays` column of your data.

In [509]:
# Apply your function to clean the essays column
for feature in ['essays']:
    data[feature]=data[feature].apply(clean_punctuation)
    
    
    
data.head()

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p036502,NV,Literacy & Language,Literacy,"Most of my kindergarten students come from low-income households and are considered \""at-risk\"" These kids walk to school alongside their parents and most have never been further than walking distance from their house For 80% of my students, English is not their first language or the language spoken at home While my kindergarten kids have many obstacles in front of them, they come to school each day excited and ready to learn Most students started the year out never being in a school setting At the start of the year many had never been exposed to letters Each day they soak up more knowledge and try their hardest to succeed They are highly motivated to learn new things every day We are halfway through the year and they are starting to take off They know know all letters, some sight words, numbers to 20, and a majority of their letter sounds because of their hard work and determination I am excited to see the places we will go from here!I currently have a differentiated sight word center that we do daily during our literacy stations The students have activities that relate to whatever sight word list they are on This is one of their favorite station activities I want to continue to provide the students with engaging ways to practice their sight words I dream of having the students use QR readers to scan the sight words that they are struggling with and the Ipods reading the sight words with them This would help so many of my students by giving them multiple exposures to the words My students need someone who can go over these sight words daily and I can't always get around to everyone to practice their flashcards with them With the Ipods they would still have a way to practice their sight words on a daily basis",Apple - iPod nano� 16GB MP3 Player (8th Generation - Latest Model) - Blue Apple - iPod nano� 16GB MP3 Player (8th Generation - Latest Model) - Silver
2,p039565,GA,"Music & The Arts, Health & Sports","Performing Arts, Team Sports","Our elementary school is a culturally rich school, with a diverse population of 580 students, in Pre-K through sixth grade Our Title I school population has 92% of students qualifying for free or reduced priced lunches and a high concentration of English Learners We also serve two foster group homes for temporary and long-term placement of homeless children We do not see these statistics as road blocks We see them as additions to our rich diversity Together we will help students to develop to their fullest potential: Creative, problem-solving, compassionate adults We strive to provide our diverse population of students with not only extra curricular activities, but an outlet for them to express themselves creatively As a teacher, I have organized a dance club for lower elementary that meets once a week after school This gives the girls something to look forward to, fosters the education of the whole child, and creates a social environment for our varied cultural student body Since beginning our dance club, I have watched our girls who normally are introverts, bloom with excitement They are also choreographing dances and productions in content areas during the school day",Reebok Girls' Fashion Dance Graphic T-Shirt - Dd Dark Heather Grey - L
3,p233823,UT,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",Hello; My name is Mrs Brotherton I teach 5th grade at Ascent Academy in Utah We are a wonderful charter school that uses the students' interests to help them learn We are always looking for wonderful teaching methods to help students My students are wonderful I have several levels in my class Our school is big on curriculum compacting and helping gifted kids move faster and struggling kids get the extra help they need We would benefit so much from your donation Every little bit helps us reach our goal I teach Science to all 88 5th grade students and have clusters of 25 new students every 8 weeks Your donation will help more than just one 5th grade class it will help several students not to mention the students in years to come We are looking to add some 3Doodler to our classroom It would be wonderful to have our own set of 3Doodler In order to help our students achieve our mission Our school is big on using technology to help students express what they have learned through a medium of their choice Having these 3Doodler in our class my fast and advanced learners will be able to go ahead and start the project while I help the other students get their The 3Doodler will also allow each one of my students find a medium to help them learn and retain the knowledge,3doodler Start Full Edu Bundle
4,p185307,NC,Health & Sports,Health & Wellness,"My students are the greatest students but are socially and economically disadvantaged We are an inner city school being limited to doing all activities in (PE) Physical Education inside because we have violence at the location where our school is located All the physical activities the students are active with are within the school so we have to have a good program Most of the students are either African American or Hispanic The students range from being enrolled as a kindergarten-8th grade Since Physical Education is important with one's success in school; all the students have PE class Monday through Friday every day The proper equipment in PE is not always possible so this is why we are here asking for your help with shelving This project is \""kid-inspired\"" in that they want better fitness They look back at their PE class and said that is only 30 minutes each day They go to an extended day school year-round school which after 3:00 in the afternoon they have 30 minutes for more physical activity They said they need more equipment to be active with The students being \""kid inspired\"" want a variety of equipment for life changing physical activity The student's project which is totally \""kid-inspired\""decided they needed more equipment to keep the them active and gaining better health This \""kid-inspired\"", they realized they needed 60 minutes of physical activity in school, and they are going to make it happen They get 30 minutes in Pe daily so they have to come up with 30 minutes more They are not getting it when they go home because their extended day at school The students said they need more equipment because they do not have what is needed in school presently The variety of activity equipment,a large variety of balls, and a parachute is what the students are asking for This equipment will increase peer-to-peer learning among students This is very important building leaders within our school and community Increasing activities in all the students will help each student have better health with 60 minutes of daily physical activity The physical activity is what all the students are talking about Getting more! The peer-to-peer learning building leaders and happiness within our school being everyone is active This activity will only be possible as \""kid-inspired\"" with the equipment they are asking for",BALL PG 4'' POLY SET OF 6 COLORS BALL PLAYGROUND POLY 8.5'' SET OF 6 KIT JUMBO GRADESTUFF PACK PARACHUTE GRIPSTARCHUTE 24 RECESS PACK GRADE K VIOLET
9,p013780,CA,Health & Sports,Health & Wellness,"My students are athletes and students who are interested in health and physical activity In my elective class my students have a garden in which we grow our own food and make healthy meals within the kitchen My students love cooking from scratch and being creative with their meals Most of my students don't know how to cook from scratch, but they learn all of the basics in my classroom If you don't have health, nothing else seems to matter For some reason in our kitchen the water comes out from the faucet white looking and not clear like most water that comes from the faucet We are not exactly sure why that is, but the students are wary of using that water for their cooking or drinking After much online research we feel that the Berkey Water Filtration system would be an ideal solution to our water problem Although the water in Compton is probably fine to drink, it would be better for us to err on the side of caution and have our water filtered before we cook with it or drink it",Crown Berkey Water Filter With 2 Black and 2 PF2 Fluoride Filters


**PROBLEM**: Define **another** custom function called `clean_re()` to clean your text data using regular expressions. Do at least two "cleanings" (i.e., show that you can use the `re` library).

In [510]:
# Define a custom function to clean some given text
import re

def clean_re(txt):
    p = re.compile(r'[^\w\s]')
    txt=p.sub('', txt)
    
    return txt

In [511]:
# Apply clean_re() to all features
from tqdm import tqdm
for feature in tqdm(FEATURE_NAMES):
    data[feature]=data[feature].fillna('').astype(str).apply(clean_re)
    
    
data.head()


100%|██████████| 5/5 [00:10<00:00,  2.16s/it]


Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p036502,NV,Literacy Language,Literacy,Most of my kindergarten students come from lowincome households and are considered atrisk These kids walk to school alongside their parents and most have never been further than walking distance from their house For 80 of my students English is not their first language or the language spoken at home While my kindergarten kids have many obstacles in front of them they come to school each day excited and ready to learn Most students started the year out never being in a school setting At the start of the year many had never been exposed to letters Each day they soak up more knowledge and try their hardest to succeed They are highly motivated to learn new things every day We are halfway through the year and they are starting to take off They know know all letters some sight words numbers to 20 and a majority of their letter sounds because of their hard work and determination I am excited to see the places we will go from hereI currently have a differentiated sight word center that we do daily during our literacy stations The students have activities that relate to whatever sight word list they are on This is one of their favorite station activities I want to continue to provide the students with engaging ways to practice their sight words I dream of having the students use QR readers to scan the sight words that they are struggling with and the Ipods reading the sight words with them This would help so many of my students by giving them multiple exposures to the words My students need someone who can go over these sight words daily and I cant always get around to everyone to practice their flashcards with them With the Ipods they would still have a way to practice their sight words on a daily basis,Apple iPod nano 16GB MP3 Player 8th Generation Latest Model Blue Apple iPod nano 16GB MP3 Player 8th Generation Latest Model Silver
2,p039565,GA,Music The Arts Health Sports,Performing Arts Team Sports,Our elementary school is a culturally rich school with a diverse population of 580 students in PreK through sixth grade Our Title I school population has 92 of students qualifying for free or reduced priced lunches and a high concentration of English Learners We also serve two foster group homes for temporary and longterm placement of homeless children We do not see these statistics as road blocks We see them as additions to our rich diversity Together we will help students to develop to their fullest potential Creative problemsolving compassionate adults We strive to provide our diverse population of students with not only extra curricular activities but an outlet for them to express themselves creatively As a teacher I have organized a dance club for lower elementary that meets once a week after school This gives the girls something to look forward to fosters the education of the whole child and creates a social environment for our varied cultural student body Since beginning our dance club I have watched our girls who normally are introverts bloom with excitement They are also choreographing dances and productions in content areas during the school day,Reebok Girls Fashion Dance Graphic TShirt Dd Dark Heather Grey L
3,p233823,UT,Math Science Literacy Language,Applied Sciences Literature Writing,Hello My name is Mrs Brotherton I teach 5th grade at Ascent Academy in Utah We are a wonderful charter school that uses the students interests to help them learn We are always looking for wonderful teaching methods to help students My students are wonderful I have several levels in my class Our school is big on curriculum compacting and helping gifted kids move faster and struggling kids get the extra help they need We would benefit so much from your donation Every little bit helps us reach our goal I teach Science to all 88 5th grade students and have clusters of 25 new students every 8 weeks Your donation will help more than just one 5th grade class it will help several students not to mention the students in years to come We are looking to add some 3Doodler to our classroom It would be wonderful to have our own set of 3Doodler In order to help our students achieve our mission Our school is big on using technology to help students express what they have learned through a medium of their choice Having these 3Doodler in our class my fast and advanced learners will be able to go ahead and start the project while I help the other students get their The 3Doodler will also allow each one of my students find a medium to help them learn and retain the knowledge,3doodler Start Full Edu Bundle
4,p185307,NC,Health Sports,Health Wellness,My students are the greatest students but are socially and economically disadvantaged We are an inner city school being limited to doing all activities in PE Physical Education inside because we have violence at the location where our school is located All the physical activities the students are active with are within the school so we have to have a good program Most of the students are either African American or Hispanic The students range from being enrolled as a kindergarten8th grade Since Physical Education is important with ones success in school all the students have PE class Monday through Friday every day The proper equipment in PE is not always possible so this is why we are here asking for your help with shelving This project is kidinspired in that they want better fitness They look back at their PE class and said that is only 30 minutes each day They go to an extended day school yearround school which after 300 in the afternoon they have 30 minutes for more physical activity They said they need more equipment to be active with The students being kid inspired want a variety of equipment for life changing physical activity The students project which is totally kidinspireddecided they needed more equipment to keep the them active and gaining better health This kidinspired they realized they needed 60 minutes of physical activity in school and they are going to make it happen They get 30 minutes in Pe daily so they have to come up with 30 minutes more They are not getting it when they go home because their extended day at school The students said they need more equipment because they do not have what is needed in school presently The variety of activity equipmenta large variety of balls and a parachute is what the students are asking for This equipment will increase peertopeer learning among students This is very important building leaders within our school and community Increasing activities in all the students will help each student have better health with 60 minutes of daily physical activity The physical activity is what all the students are talking about Getting more The peertopeer learning building leaders and happiness within our school being everyone is active This activity will only be possible as kidinspired with the equipment they are asking for,BALL PG 4 POLY SET OF 6 COLORS BALL PLAYGROUND POLY 85 SET OF 6 KIT JUMBO GRADESTUFF PACK PARACHUTE GRIPSTARCHUTE 24 RECESS PACK GRADE K VIOLET
9,p013780,CA,Health Sports,Health Wellness,My students are athletes and students who are interested in health and physical activity In my elective class my students have a garden in which we grow our own food and make healthy meals within the kitchen My students love cooking from scratch and being creative with their meals Most of my students dont know how to cook from scratch but they learn all of the basics in my classroom If you dont have health nothing else seems to matter For some reason in our kitchen the water comes out from the faucet white looking and not clear like most water that comes from the faucet We are not exactly sure why that is but the students are wary of using that water for their cooking or drinking After much online research we feel that the Berkey Water Filtration system would be an ideal solution to our water problem Although the water in Compton is probably fine to drink it would be better for us to err on the side of caution and have our water filtered before we cook with it or drink it,Crown Berkey Water Filter With 2 Black and 2 PF2 Fluoride Filters


In [512]:
data.to_csv('clean_re.csv', index=False)    

data.shape

(182080, 6)

In [513]:
data=pd.read_csv('clean_re.csv',lineterminator='\n')

In [514]:
data.shape


(182080, 6)

In [515]:
data['description'].head(10)

0    Apple  iPod nano 16GB MP3 Player 8th Generation  Latest Model  Blue Apple  iPod nano 16GB MP3 Player 8th Generation  Latest Model  Silver                                           
1    Reebok Girls Fashion Dance Graphic TShirt  Dd Dark Heather Grey  L                                                                                                                  
2    3doodler Start Full Edu Bundle                                                                                                                                                      
3    BALL PG 4 POLY SET OF 6 COLORS BALL PLAYGROUND POLY 85 SET OF 6 KIT JUMBO GRADESTUFF PACK PARACHUTE GRIPSTARCHUTE 24 RECESS PACK GRADE K VIOLET                                     
4    Crown Berkey Water Filter With 2 Black and 2 PF2 Fluoride Filters                                                                                                                   
5    Amazon  Fire Kids Edition  7 Tablet  16GB  Green Amazon  Fire Kid

In [516]:
data.drop_duplicates().shape

(182080, 6)

**PROBLEM**: Remove stopwords. (Hint: use stopwords from nltk's `stopwords()` plus any additions you'd like to make. Then, again, define a custom function and then apply it to all features.)

In [517]:
stopWords.extend(['although','engaging','approximately','yet','nan','u','us','would','would','see','big','student','school','many'])


In [518]:
# Define custom function to remove stopwords
df = copy.copy(data)  
df=df.drop_duplicates()
def clean_stopword(txt):
    txt=txt.lower().split()
    filtered_sentence = [w for w in txt if  w  not in  stopWords]  
    filtered_sentence=' '.join(filtered_sentence)
    return filtered_sentence

In [519]:
# Apply function to remove stopwords  
from tqdm import tqdm
for feature in tqdm(FEATURE_NAMES):
    df[feature]=df[feature].fillna(' ').apply(clean_stopword)
    
    
df.to_csv('clean_stopword.csv', index=False)       
df.head(10)



100%|██████████| 5/5 [1:27:06<00:00, 1045.36s/it]


Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p036502,nv,literacy language,literacy,kindergarten students come lowincome households considered atrisk kids walk alongside parents never walking distance house 80 students english first language language spoken home kindergarten kids obstacles front come day excited ready learn students started year never setting start year never exposed letters day soak knowledge try hardest succeed highly motivated learn new things every day halfway year starting take know know letters sight words numbers 20 majority letter sounds hard work determination excited places go herei currently differentiated sight word center daily literacy stations students activities relate whatever sight word list one favorite station activities want continue provide students ways practice sight words dream students qr readers scan sight words struggling ipods reading sight words help students giving multiple exposures words students need someone go sight words daily cant always get around everyone practice flashcards ipods still way practice sight words daily basis,apple ipod nano 16gb mp3 player 8th generation latest model blue apple ipod nano 16gb mp3 player 8th generation latest model silver
1,p039565,ga,music arts health sports,performing arts team sports,elementary culturally rich diverse population 580 students prek sixth grade title population 92 students qualifying free reduced priced lunches high concentration english learners also serve two foster group homes temporary longterm placement homeless children statistics road blocks additions rich diversity together help students develop fullest potential creative problemsolving compassionate adults strive provide diverse population students extra curricular activities outlet express creatively teacher organized dance club lower elementary meets week gives girls something look forward fosters education whole child creates social environment varied cultural body since beginning dance club watched girls normally introverts bloom excitement also choreographing dances productions content areas day,reebok girls fashion dance graphic tshirt dd dark heather grey l
2,p233823,ut,math science literacy language,applied sciences literature writing,hello name mrs brotherton teach 5th grade ascent academy utah wonderful charter uses students interests help learn always looking wonderful teaching methods help students students wonderful several levels class curriculum compacting helping gifted kids move faster struggling kids get extra help need benefit much donation every little bit helps reach goal teach science 88 5th grade students clusters 25 new students every 8 weeks donation help one 5th grade class help several students mention students years come looking add 3doodler classroom wonderful set 3doodler order help students achieve mission using technology help students express learned medium choice 3doodler class fast advanced learners able go ahead start project help students get 3doodler also allow one students find medium help learn retain knowledge,3doodler start full bundle
3,p185307,nc,health sports,health wellness,students greatest students socially economically disadvantaged inner city limited activities pe physical education inside violence location located physical activities students active within good program students either african american hispanic students range enrolled kindergarten8th grade since physical education important ones success students pe class monday friday every day proper equipment pe always possible asking help shelving project kidinspired want better fitness look back pe class said 30 minutes day go extended day yearround 300 afternoon 30 minutes physical activity said need equipment active students kid inspired want variety equipment life changing physical activity students project totally kidinspireddecided needed equipment keep active gaining better health kidinspired realized needed 60 minutes physical activity going make happen get 30 minutes pe daily come 30 minutes getting go home extended day students said need equipment needed presently variety activity equipmenta large variety balls parachute students asking equipment increase peertopeer learning among students important building leaders within community increasing activities students help better health 60 minutes daily physical activity physical activity students talking getting peertopeer learning building leaders happiness within everyone active activity possible kidinspired equipment asking,ball pg 4 poly set 6 colors ball playground poly 85 set 6 kit jumbo gradestuff pack parachute gripstarchute 24 recess pack grade k violet
4,p013780,ca,health sports,health wellness,students athletes students interested health physical activity elective class students garden grow food make healthy meals within kitchen students love cooking scratch creative meals students dont know cook scratch learn basics classroom dont health nothing else seems matter reason kitchen water comes faucet white looking clear like water comes faucet exactly sure students wary using water cooking drinking much online research feel berkey water filtration system ideal solution water problem water compton probably fine drink better err side caution water filtered cook drink,crown berkey water filter 2 black 2 pf2 fluoride filters
5,p063374,de,applied learning literacy language,character education literature writing,kids tell day want make one happy teacher respectful ontask accountable responsible safe schools core values first graders amaze determination improvement academically watched make half years worth learning two months babies defying odds way closing achievement gap need support started program called telementoring hopes students receive virtual mentor response awesome adults providing much needed encouragement support motivation students need students writing improved tremendously also progressing reading aided online interaction mentors need tablets set mentoring station allow students participate daily expedite academic social growth currently three desktop computers working used access math software telementors please help connect,amazon fire kids edition 7 tablet 16gb green amazon fire kids edition 7tablet 16gb blue
6,p103285,mo,health sports,health wellness,kindergarten new first grade students held high expectations kindergarten include learning read addition subtraction among things still young biggest challenges attention spans sitting still take dance breaks throughout day always need sort outlet excess energy students love superheroes minecraft barbies recess course opportunities day play never seems enough kiddos huge imaginations tons creativity boot students classroom come diverse socioeconomic cultural backgrounds special needs students classroom well highly transient population balance discs stools flexible seating students choice classroom help prepare real world hope create selfregulated learners giving students choice able hopefully make safe choice also meet sensory needs students able learn grow environment suited needs age need active learn better outlet fidgets focus improve learning get done donations hope help students attention spans result higher achievement also create opportunities active engagement classroom want motivate students want come learn giving students choice also improve community within classroom allow better relationship building,hokki stool 15 black therapists choice inflated airfilled stability balance disc
7,p181781,sc,applied learning literacy language,early development literature writing,first graders fantastic excited learn grow day students wonders curious world around students 6 7yearolds high poverty elementary south carolina sitting still something love students need active learning environment thankful donation give help reach goal thank youfirst graders love learning need 6 wigglestools allow students wiggle learn keep brains bodies active educator ive spent time researching best practices students one common strategies ive seen keeping students active said stools go good first graders learn well small group scenarios chance move around learning reading writing math science plan stools smallgroup learning table teaching reading writing math thank generosity,ecr4kids ace active sitting stool 15 black
8,p114989,,math science,mathematics,seventh graders dream cant wait go college dream college careers healthcare engineering law get goals going work pretty hard overcome obstacles almost students qualify free reduced lunch 65 students perform grade level standardized tests 30 learning english middle 25 learning disability impacts academic performance despite students work hard every single day try close gaps used alternative seating classroom years help active students seen huge growth engagement translates increase grades test scores want reach students including students need move learn used exercise ball chairs chairs wheels trying wobble stools cant wait classroom durable fun allow students move around without distracting students around students love choice sit thinking type seat helps learn helps develop self awareness better advocates education,kore patented wobble chair made usa active sitting toddler preschool kids teens kids dont sit still anymore best seat classroom black teen 18in
9,p191410,il,literacy language,literacy,teach first grade small farming town illinois diverse students population one side town students come halfmilliondollar homes side town students come publichousing situations 30 population receives free reduced price lunch fees goal students give every one opportunity learn every advantage possible accomplish one thing short time students know truly love value individuals nothing better snuggling good book getting lost strive create love reading motivating students want read past several years adding furniture couch pillows tent students snuggle read book love add 2 reclining chairs students cozy reading currently students comfortable space reading adding two recliners students able space reading time seen first hand students develop love reading able get comfortable students beg extend silent reading time,ace bayou 6498501 juvenile recliner urban vinyl black


In [520]:
df=pd.read_csv('clean_stopword.csv',lineterminator='\n')

In [521]:
df[-10:]

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
182070,p068185,ok,literacy language math science,literature writing mathematics,hamilton elementary title 1 recieves 100 free breakfast lunch students things life working want make sure teacher students afforded opportunities educational equity students access upper grade classrooms switching self paced learning platform called summit program allows students set goals every week oneonone mentor work small group settings recieve individual attention regular gen ed classroom gives students power choose time place pace path want make sure teacher students afforded opportunities educational equity students access resources used classroom students freedom choose place work comfortable working current desks plastic hard chairs cracking edges definitely ideal 9 hour extended day new comfortable seating arrangements students able work comfort group projects individualized learning small teacherled groups please support hamilton husky family make dream kiddos reality anything friend donate help greatly,joe dorm chair limo black inflated stability wobble cushion including free pump exercise fitness core balance discbluesize 13 inches 33 cm diameter norwood commercial furniture nor1101acso plastic stack stools 1775 height 1175 width 1175 length assorted pack 5 peace yoga zafu meditation yoga buckwheat filled round cotton bolster pillow cushion blue 16 x 16
182071,p248714,tn,literacy language math science,literacy mathematics,first grade students come everyday excited ready learn students classroom come variety backgrounds fortunate enough travel world others fortunate regardless students background administrators teachers parents volunteers whatever takes make sure children needs met known community district going beyond expectations meet needs students walked students reading books playing instruments acting dancing drawing beautiful pictures schools mission create culturally diverse tradition excellence students encouraged excel academically learning skills necessary responsible confident lifelong learners ad productive members everchanging society order make classroom inviting environment give students space need focus learning must rug classroom need rug morning meeting whole group lessons calendar time rug classroom allow students stay focused learning common place meet share ideas new learning storage drawers post charts markers post notes educational books enhance learning literacy math stations small group lessons,number sense routines building numerical literacy every day grades k3 ottomanson jenny collection blue base multi colors kids childrens educational solar system design area classroom rugs 710 x 910 dark red postit selfstick easel pad 25 x 305 inches 30sheet pad 2 pack postit super sticky notes 4 x 4 rio de janeiro collection lined 6 padspack 6756ssuc sharpie flip chart markers assorted colors box 8 22478 sterilite clearview 3 storage drawer organizer teaching common core reading standards literature 2
182072,p045565,tx,applied learning literacy language,early development literature writing,blessed work rural quickly growing town north houston texas kinderbabies come variety backgrounds living lakeside fancy houses sharing beds siblings old trailers title 1 campus sets high expectations students academically personally desire give experiences opportunities succeed want push achieve goals develop selfmotivated learners fun learning goal teach students independent learning individuals items allow teach independence writing organization promoting writing process developing kinderbabies confident students students able get materials quickly easily minimizing time spend passing allowing spend time working students oneonone developing skills kindergarten students confident set successful years,dg273 heavyduty shelves cubbies storage unit jj926 classic birch tabletop writing center lm113 heavyduty bins set 3 ra414 tabletop paper center rr269 lakeshore paper storage center
182073,p078709,tx,health sports,team sports,group inspiring students make impact world high achievers classroom limited resources students grow vegetables maintain lush garden walks habitat provide students opportunities learn students range kindergarten 5th grade eagerness learn unmatched want challenged want challenge greatly benefit support students brightest district academic achievements astounding willing challenged able rise occasion materials requested enhance mental physical learning research shows athletes absent 50 less often 11 higher graduation rate four times likely attend college despite powerful impact difficult find funds run successful sports programs trying build athletics program campus unfortunately limited resources support students achieve even greater heights,champion equipment cart one color one size coachdeck instructional basketball drill cards one color one size dicks sporting goods squeeze bottles carrier one color one size krazy netz 12 loop basketball net royal one size lifetime basketball scorebook one color one size sklz court vision dribbling goggles one color one size sklz dman defensive mannequin one color one size sklz double double shooting rebounding basketball trainer one color one size sklz heavy weight control training basketball 295 one color one size sklz lightweight control training basketball 225 one color one size spalding tf1000 classic basketball 285 one color one size spalding tf1000 classic official basketball 295 one color one size
182074,p184627,ca,literacy language special needs,literacy special needs,students class sixth seventh eighth graders attend public middle large urban district students come culturally linguistically diverse backgrounds majority students low income households students class identified special learning needs auditory processing deficits attention deficits make difficult find success traditional classroom setting asking forty new young adult novels books bestsellers recommended reads right students dont access books budget cuts closed library getting safely public library isnt possible students students reluctant readers found new books books friends talking books want read overcome reluctance takes one book overcome reluctance read want make sure book shelf ready pick please help inspiring students building lifelong readers,mangoshaped space new class star wars jedi academy 4 nate blasts nate flips nate flips nate goes broke nate lives nate revenge cream puffs charlie bumpers vs teacher year diamond willow dog man creator captain underpants dog man 1 double diary wimpy kid 11 el deafo evil spy gone fishing novel verse harry potter cursed child parts one two special rehearsal edition script official script book original west end production harry potter sorcerers stone hunt bamboo rat youre reading inspector flytrap book 1 inspector flytrap presidents mane missing book 2 looking alaska loot lucy andy neanderthal lucy andy neanderthal march book three march book two mountain dog pi sky rabbit robot sleepover candlewick sparks spaced moon base alpha spy ski spy candymakers candymakers great chocolate chase forbidden library great wall lucy wu last kids earth zombie parade outsiders timmy failure sanitized protection timmy failure book youre supposed zane hurricane story katrina
182075,p014188,nm,math science,mathematics,currently teach math lowincome students dont electricity running water difficulties face challenges head students also handson learners need different ways practice math skills taught math starting get difficult makes imperative find way make math practice fun philosophy everyone learn math presented right way students seem get board math quickly practice needed master skill dull ordering books games able continue get better math fun reproducible books keep students engaged students need practice skills master concepts books students fun games create fun atmosphere students play learn time students longer dread math look forward,190 readytouse activities make math fun math dash multiplicationdivision math puzzles brainteasers grades 68 middle math game kit middle math design prealgebra design book
182076,p116452,az,music arts,performing arts,students predominantly hispanic often little exposure performing arts goal instill love theatre within children help develop terms confidence amazing talents like voices heard literally metaphorically students courteous committed helpful safe feel also deserving please show students society values arts taken back earlier year brought personal sewing machine sew centipede costume james giant peach boys typically like building tech stage production really interested sewing real men sew able get buy one sewing machine class one sewing machine group 20 plus kids like telling share 1 pencil order assignment one students last year said ever since taught sew hemmed pants family members home sewing practical skill love provide students access serger perfectionists,brother designio series dz1234 serger
182077,p074761,az,math science,applied sciences environmental science,teach science eighth graders suburban arizona students crave knowledge experiences make science come alive bright talented fun students taking physics chemistry 8th grade thats amazing accomplishment choice class hunger knowledge knows bounds students scientists mathematicians engineers computer geeks tomorrow ready take world need fundamentals steam project students able make schoolspirited environmentallyfriendly products sell spirit store items designed manufactured marketed students making true problembased learning project students tools provided donors choose grant learn teach others sustainability learn environmental responsibility part steam program students opportunity make tee shirts drink mugs vinyl labels buttons choosing environmentallyfriendly sayings graphics using environmentally responsible materials students challenges successes sustainability teach others,100 buttons cover made usa cover buttons wire eye backs size 60 1 12 button maker badge making machine button maker machine 1 inch3 inch 25mm 75mm aluminum round mold base craft e vinyl 12 x 12 40 sheets assorted glossy colors permanent adhesive backed vinyl cricut cutters craftrobo cutters pazzles cutters quickutz cutters cev1200 cricut explore air 2 mint machine bundle heat transfer exclusive designs cricut german carbide premium blade cricut tools scoring stylus dawei 200pcs 2 50mm topmetalbottom cover clip pin blank badge button parts badge maker machine neil enterprises pin back button parts machine 225 inch pack 500 permanent self adhesive backed vinyl sheets cricut silhouette cameo crafting craft cutters 12x12 38 sheets assorted colors
182078,p136737,fl,literacy language,literacy,work group wonderful second grade students bright eager please ready learn however title means 90 breakfasts lunches served free lowincome area students come variety backgrounds vastly different learning levels goal create learning environment satisfies students needs give every opportunity best know succeed ever went library find good book ever went library find good book found disaster current problem library inadequate storage leaves library mess try might books space wonderful horrible problem students struggle find books like 3 drawer organizers label drawer according level topic students longer dig books find story like reading level become discouraged read wonderful book find outside reading level cannot take r test shelves make difference helping hold young readers attention quickly easily able find looking,iris 3drawer storage cart organizer top white
182079,p190772,tx,literacy language,literacy,balanced literacy mystery couple years ago district began implementing program typical morning classroom involves wholegroup lesson stations assignments smallgroup teaching based students reading levels beginning year reading levels students range preschool 4th 5th grade offers programs incentives keep kids reading year long strive help reach reading potential fact want become leaders reading listening station check audio books meet needs students whether struggling advanced readers every benefit individual level fluency comprehension greatly impacted resource better way encourage reader leadersmy students able listen books read aloud cd participate literacy group book club activities go along books read students complete journal writing word study creative extension activities part balanced literacy using convenient listening station donations project help students become better readers writers speakers able reach reading goals conquer state test hope also become lifetime readers learners leaders,ce764 readytogo listening center 4 re209x leveled ln readalongs


In [522]:
df.shape

(182080, 6)

**PROBLEM**: Now use Gensim’s `simple_preprocess()` function to tokenize and clean up your text data. TIP: `simple_preprocess()` returns a list of words, so we want to wrap it with a function that joins the list back together into a string.

In [523]:
# Define custom function to wrap c from gensim
from gensim.utils import simple_preprocess
df2 = copy.copy(df) 
def simple_preprocess_custom(txt):
    txt=simple_preprocess(txt,deacc=True)
    txt=' '.join(txt)
    return txt

In [524]:
# Apply simple_preprocess() to all features

for feature in tqdm(FEATURE_NAMES):
    df2[feature]=df2[feature].fillna('').apply(simple_preprocess_custom)

100%|██████████| 5/5 [36:12<00:00, 434.54s/it]


In [525]:
df2.head()
df2.to_csv('simple_preprocess_custom.csv', index=False)       


In [526]:
df2=pd.read_csv('simple_preprocess_custom.csv',lineterminator='\n')


In [527]:
df2.head()

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p036502,nv,literacy language,literacy,kindergarten students come lowincome households considered atrisk kids walk alongside parents never walking distance house students english first language language spoken home kindergarten kids obstacles front come day excited ready learn students started year never setting start year never exposed letters day soak knowledge try hardest succeed highly motivated learn new things every day halfway year starting take know know letters sight words numbers majority letter sounds hard work determination excited places go herei currently differentiated sight word center daily literacy stations students activities relate whatever sight word list one favorite station activities want continue provide students ways practice sight words dream students qr readers scan sight words struggling ipods reading sight words help students giving multiple exposures words students need someone go sight words daily cant always get around everyone practice flashcards ipods still way practice sight words daily basis,apple ipod nano gb mp player th generation latest model blue apple ipod nano gb mp player th generation latest model silver
1,p039565,ga,music arts health sports,performing arts team sports,elementary culturally rich diverse population students prek sixth grade title population students qualifying free reduced priced lunches high concentration english learners also serve two foster group homes temporary longterm placement homeless children statistics road blocks additions rich diversity together help students develop fullest potential creative problemsolving compassionate adults strive provide diverse population students extra curricular activities outlet express creatively teacher organized dance club lower elementary meets week gives girls something look forward fosters education whole child creates social environment varied cultural body since beginning dance club watched girls normally introverts bloom excitement also choreographing dances productions content areas day,reebok girls fashion dance graphic tshirt dd dark heather grey
2,p233823,ut,math science literacy language,applied sciences literature writing,hello name mrs brotherton teach th grade ascent academy utah wonderful charter uses students interests help learn always looking wonderful teaching methods help students students wonderful several levels class curriculum compacting helping gifted kids move faster struggling kids get extra help need benefit much donation every little bit helps reach goal teach science th grade students clusters new students every weeks donation help one th grade class help several students mention students years come looking add doodler classroom wonderful set doodler order help students achieve mission using technology help students express learned medium choice doodler class fast advanced learners able go ahead start project help students get doodler also allow one students find medium help learn retain knowledge,doodler start full bundle
3,p185307,nc,health sports,health wellness,students greatest students socially economically disadvantaged inner city limited activities pe physical education inside violence location located physical activities students active within good program students either african american hispanic students range enrolled kindergarten th grade since physical education important ones success students pe class monday friday every day proper equipment pe always possible asking help shelving project kidinspired want better fitness look back pe class said minutes day go extended day yearround afternoon minutes physical activity said need equipment active students kid inspired want variety equipment life changing physical activity students project totally needed equipment keep active gaining better health kidinspired realized needed minutes physical activity going make happen get minutes pe daily come minutes getting go home extended day students said need equipment needed presently variety activity equipmenta large variety balls parachute students asking equipment increase peertopeer learning among students important building leaders within community increasing activities students help better health minutes daily physical activity physical activity students talking getting peertopeer learning building leaders happiness within everyone active activity possible kidinspired equipment asking,ball pg poly set colors ball playground poly set kit jumbo gradestuff pack parachute gripstarchute recess pack grade violet
4,p013780,ca,health sports,health wellness,students athletes students interested health physical activity elective class students garden grow food make healthy meals within kitchen students love cooking scratch creative meals students dont know cook scratch learn basics classroom dont health nothing else seems matter reason kitchen water comes faucet white looking clear like water comes faucet exactly sure students wary using water cooking drinking much online research feel berkey water filtration system ideal solution water problem water compton probably fine drink better err side caution water filtered cook drink,crown berkey water filter black pf fluoride filters


In [528]:
df2.shape

(182080, 6)

**PROBLEM**: Lemmatize the text. (Hint: Define a custom function and then apply it to all features.)

In [529]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ekaterinakryukova/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [530]:
importlib.reload(nltk)
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from collections import Counter

In [531]:
df3=copy.copy(df2)



In [532]:
df3.head(1)

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p036502,nv,literacy language,literacy,kindergarten students come lowincome households considered atrisk kids walk alongside parents never walking distance house students english first language language spoken home kindergarten kids obstacles front come day excited ready learn students started year never setting start year never exposed letters day soak knowledge try hardest succeed highly motivated learn new things every day halfway year starting take know know letters sight words numbers majority letter sounds hard work determination excited places go herei currently differentiated sight word center daily literacy stations students activities relate whatever sight word list one favorite station activities want continue provide students ways practice sight words dream students qr readers scan sight words struggling ipods reading sight words help students giving multiple exposures words students need someone go sight words daily cant always get around everyone practice flashcards ipods still way practice sight words daily basis,apple ipod nano gb mp player th generation latest model blue apple ipod nano gb mp player th generation latest model silver


In [533]:
# Write a lemmatization function based on nltk.stem.WordNetLemmatizer() -didn't complete for night


In [534]:
def get_pos( word ):
    w_synsets = wordnet.synsets(word)

    pos_counts = Counter()
    pos_counts["n"] = len(  [ item for item in w_synsets if item.pos()=="n"]  )
    pos_counts["v"] = len(  [ item for item in w_synsets if item.pos()=="v"]  )
    pos_counts["a"] = len(  [ item for item in w_synsets if item.pos()=="a"]  )
    pos_counts["r"] = len(  [ item for item in w_synsets if item.pos()=="r"]  )
    
    most_common_pos_list = pos_counts.most_common(3)
    return most_common_pos_list[0][0]

In [535]:

wnl = WordNetLemmatizer()
w_tokenizer = WhitespaceTokenizer()
def lemmatize_text(text):
    
    #lemmatize = lru_cache(maxsize=50000)(wnl.lemmatize)
    text=[wnl.lemmatize(w,get_pos(w)) for w in w_tokenizer.tokenize(text)]
    return text

In [536]:
# Apply lemmatize_text() to all features  
#As it is impossible to process all data, i will process 500 000
from tqdm import tqdm


from tqdm._tqdm_notebook import tqdm_notebook

for feature in tqdm(['essays', 'description']):
    tqdm_notebook.pandas()
    df3[feature]=df3[feature].fillna(' ').progress_apply(lemmatize_text)


  0%|          | 0/2 [00:00<?, ?it/s]

HBox(children=(FloatProgress(value=0.0, max=182080.0), HTML(value='')))

 50%|█████     | 1/2 [6:45:45<6:45:45, 24345.58s/it]




HBox(children=(FloatProgress(value=0.0, max=182080.0), HTML(value='')))

100%|██████████| 2/2 [6:50:34<00:00, 12317.49s/it]  







In [537]:
df3.head(1)

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p036502,nv,literacy language,literacy,"[kindergarten, student, come, lowincome, household, consider, atrisk, kid, walk, alongside, parent, never, walk, distance, house, student, english, first, language, language, speak, home, kindergarten, kid, obstacle, front, come, day, excite, ready, learn, student, start, year, never, set, start, year, never, expose, letter, day, soak, knowledge, try, hard, succeed, highly, motivate, learn, new, thing, every, day, halfway, year, start, take, know, know, letter, sight, word, number, majority, letter, sound, hard, work, determination, excite, place, go, herei, currently, differentiate, sight, word, center, daily, literacy, station, student, activity, relate, whatever, sight, word, list, one, favorite, station, activity, want, continue, provide, student, way, practice, sight, ...]","[apple, ipod, nano, gb, mp, player, th, generation, late, model, blue, apple, ipod, nano, gb, mp, player, th, generation, late, model, silver]"


In [538]:
df3.head(1)

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p036502,nv,literacy language,literacy,"[kindergarten, student, come, lowincome, household, consider, atrisk, kid, walk, alongside, parent, never, walk, distance, house, student, english, first, language, language, speak, home, kindergarten, kid, obstacle, front, come, day, excite, ready, learn, student, start, year, never, set, start, year, never, expose, letter, day, soak, knowledge, try, hard, succeed, highly, motivate, learn, new, thing, every, day, halfway, year, start, take, know, know, letter, sight, word, number, majority, letter, sound, hard, work, determination, excite, place, go, herei, currently, differentiate, sight, word, center, daily, literacy, station, student, activity, relate, whatever, sight, word, list, one, favorite, station, activity, want, continue, provide, student, way, practice, sight, ...]","[apple, ipod, nano, gb, mp, player, th, generation, late, model, blue, apple, ipod, nano, gb, mp, player, th, generation, late, model, silver]"


In [539]:
df33=copy.copy(df3)

In [540]:
stopWords.extend(['learn', 'classroom','help','need','work','th','come','class','love','able','year','time','want','make'])


In [541]:
#AFTER LEMMATIZATION IT IS NEEDED TO REMOVE STTOPWORDS
from tqdm import tqdm
for feature in tqdm(['essays', 'description']):
    df33[feature]=df33[feature].fillna(' ').apply(lambda w: [i for i in w if  i  not in  stopWords])

100%|██████████| 2/2 [01:35<00:00, 47.68s/it]


In [542]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
df33.head(1)

Unnamed: 0,id,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,p036502,nv,literacy language,literacy,"[kindergarten, lowincome, household, consider, atrisk, kid, walk, alongside, parent, never, walk, distance, house, english, first, language, language, speak, home, kindergarten, kid, obstacle, front, day, excite, ready, start, never, set, start, never, expose, letter, day, soak, knowledge, try, hard, succeed, highly, motivate, new, thing, every, day, halfway, start, take, know, know, letter, sight, word, number, majority, letter, sound, hard, determination, excite, place, go, herei, currently, differentiate, sight, word, center, daily, literacy, station, activity, relate, whatever, sight, word, list, one, favorite, station, activity, continue, provide, way, practice, sight, word, dream, qr, reader, scan, sight, word, struggle, ipod, read, sight, word, give, multiple, ...]","[apple, ipod, nano, gb, mp, player, generation, late, model, blue, apple, ipod, nano, gb, mp, player, generation, late, model, silver]"


In [543]:
df33.to_csv('lemmatize_text.csv', index=False)  

**PROBLEM**: What happened to the data in the pandas dataframe>

ANSWER: It was converted from long text into a list of individual words.

# PART 4:  Make an LDA topic model for the ESSAYS.

**Note: Part 4 is worth 10 points (the value of 10 individual problems).**

Define an LDA topic model for the `essays`. Compute the "Coherence score." Visually inspect the topic model by inspecting the top keywords from each model. Gensim provides functions for all of these tasks.  

In [544]:

# Create Dictionary
id2word = corpora.Dictionary( df33['essays'])

# Create Corpus
texts =  df33['essays']
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View unique id for each word in the essay
print(corpus[:1])


[[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 3), (12, 3), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 2), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 2), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 2), (38, 2), (39, 2), (40, 2), (41, 1), (42, 2), (43, 3), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 3), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 3), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 8), (67, 1), (68, 1), (69, 1), (70, 1), (71, 3), (72, 2), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 2), (80, 2), (81, 1), (82, 9)]]


In [545]:
# View word a given id corresponds to for id=0
id2word[0]

'activity'

In [546]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('activity', 2),
  ('alongside', 1),
  ('always', 1),
  ('around', 1),
  ('atrisk', 1),
  ('basis', 1),
  ('cant', 1),
  ('center', 1),
  ('consider', 1),
  ('continue', 1),
  ('currently', 1),
  ('daily', 3),
  ('day', 3),
  ('determination', 1),
  ('differentiate', 1),
  ('distance', 1),
  ('dream', 1),
  ('english', 1),
  ('every', 1),
  ('everyone', 1),
  ('excite', 2),
  ('expose', 1),
  ('exposure', 1),
  ('favorite', 1),
  ('first', 1),
  ('flashcard', 1),
  ('front', 1),
  ('get', 1),
  ('give', 1),
  ('go', 2),
  ('halfway', 1),
  ('hard', 2),
  ('herei', 1),
  ('highly', 1),
  ('home', 1),
  ('house', 1),
  ('household', 1),
  ('ipod', 2),
  ('kid', 2),
  ('kindergarten', 2),
  ('know', 2),
  ('knowledge', 1),
  ('language', 2),
  ('letter', 3),
  ('list', 1),
  ('literacy', 1),
  ('lowincome', 1),
  ('majority', 1),
  ('motivate', 1),
  ('multiple', 1),
  ('never', 3),
  ('new', 1),
  ('number', 1),
  ('obstacle', 1),
  ('one', 1),
  ('parent', 1),
  ('place', 1),
  ('prac

In [547]:
#df.project_subject_categories.value_counts()

In [556]:
#I have tried 10 and 7 topics -> the best coherence score is for 5
#freeze_support()
    ##
    ##z Build Multicore LDA
lda_multicore_model = gensim.models.ldamulticore.LdaMulticore(corpus, num_topics = 8, id2word=id2word)
    # Saving trained model
    #lda_multicore_model.save('LDA_NYT_multicore')


Process ForkPoolWorker-263:
Process ForkPoolWorker-268:
Process ForkPoolWorker-265:
Process ForkPoolWorker-267:
Process ForkPoolWorker-262:
Process ForkPoolWorker-266:
Process ForkPoolWorker-264:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **se

KeyboardInterrupt: 

KeyboardInterrupt
KeyboardInterrupt


In [549]:
lda_multicore_model.save('LDA_NYT_multicore8')

In [None]:
# Loading trained model
lda_multicore_model = gensim.models.ldamodel.LdaModel.load('LDA_NYT_multicore8')


In [551]:
# Print the Keyword in the 7 topics
print(lda_multicore_model.print_topics())
doc_lda = lda_multicore_model[corpus]

[(0, '0.009*"science" + 0.009*"project" + 0.009*"create" + 0.009*"art" + 0.008*"technology" + 0.007*"opportunity" + 0.006*"well" + 0.006*"world" + 0.006*"experience" + 0.006*"give"'), (1, '0.015*"read" + 0.010*"seat" + 0.009*"day" + 0.008*"sit" + 0.008*"allow" + 0.007*"chair" + 0.007*"move" + 0.007*"provide" + 0.007*"get" + 0.006*"focus"'), (2, '0.011*"skill" + 0.008*"provide" + 0.008*"material" + 0.008*"day" + 0.008*"math" + 0.007*"technology" + 0.007*"activity" + 0.007*"allow" + 0.006*"also" + 0.006*"new"'), (3, '0.024*"read" + 0.021*"book" + 0.007*"day" + 0.007*"get" + 0.007*"new" + 0.007*"home" + 0.006*"grade" + 0.006*"level" + 0.006*"child" + 0.006*"one"'), (4, '0.013*"technology" + 0.008*"math" + 0.008*"read" + 0.008*"skill" + 0.007*"also" + 0.007*"access" + 0.007*"computer" + 0.007*"grade" + 0.006*"one" + 0.006*"language"')]


In [552]:
# Compute Perplexity
#print('\nPerplexity: ', lda_multicore_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_multicore_model, texts=texts, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.30515799673452154


If you use gensim and the following three variables, then you can visualize topics & keywords with the code below.

    lda_model:    this is an LDA model generated by gensim.models.ldamodel.LdaModel()
    id2word:      this is the dictionary term IDs from corpora.Dictionary()
    corpus:       this is the collection of "documents"


In [553]:
# Visualize topics-keywords
lda_model=lda_multicore_model
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

# PART 5:  Make an LDA topic model for the DESCRIPTIONS.

**Note: Part 5 is worth 5 points (the value of 5 individual problems).**

Using the same K (and any other hyperparameters from Part 4), recompute a model for Descriptions. Compare the two sets of results. Do they vary? How? Why? Explain what you find. 

In [554]:
df3['description'].head()

0    [apple, ipod, nano, gb, mp, player, th, generation, late, model, blue, apple, ipod, nano, gb, mp, player, th, generation, late, model, silver]
1    [reebok, girl, fashion, dance, graphic, tshirt, dd, dark, heather, grey]                                                                      
2    [doodler, start, full, bundle]                                                                                                                
3    [ball, pg, poly, set, color, ball, playground, poly, set, kit, jumbo, gradestuff, pack, parachute, gripstarchute, recess, pack, grade, violet]
4    [crown, berkey, water, filter, black, pf, fluoride, filter]                                                                                   
Name: description, dtype: object

In [555]:

# Create Dictionary
id2word_description = corpora.Dictionary([d.split() for d in df3['description']])

# Create Corpus
texts_description = [d.split() for d in df3['description']]
# Term Document Frequency
corpus_description = [id2word_description.doc2bow(text) for text in texts_description]

# View unique id for each word in the essay
print(corpus_description[:1])


AttributeError: 'list' object has no attribute 'split'

In [None]:
# 
import time
start_time = time.time()
##
## Build Multicore LDA
lda_multicore_model_d = gensim.models.ldamulticore.LdaMulticore(corpus_description, num_topics = 5, id2word=id2word_description)
# Saving trained model
lda_multicore_model_d.save('LDA_NYT_multicore_d2')
# Loading trained model
lda_multicore_model_d = gensim.models.ldamodel.LdaModel.load('LDA_NYT_multicore_d2')
## Print time taken to train the model
print("--- %s seconds ---" % (time.time() - start_time))




In [None]:
# Print the Keyword in the 7 topics
print(lda_multicore_model_d.print_topics())
doc_lda_d = lda_multicore_model_d[corpus_description]

In [None]:
# Compute Perplexity
#print('\nPerplexity: ', lda_multicore_model_d.log_perplexity(corpus_description))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_multicore_model_d, texts=texts_description, dictionary=id2word_description, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
# Visualize topics-keywords
lda_model=lda_multicore_model_d
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus_description, id2word_description)
vis

# PART 6:  Use TextHero and help to improve it.

**Note: This is worth 5 points (the value of 5 individual problems).**

[TextHero](https://texthero.org/) is an opensource project developed by a student from the [TIS Lab of Prof. Younge](www.epfl.ch/labs/tis). Go to the [GIT repository for TextHero](https://github.com/jbesomi/texthero), install the package, review the documentation, and if you are impressed by the package - give it a star and tell others! (Not required)

Once you understand TextHero, then use the package to re-implement major portions of Part 3 of this assignment that you completed above.  

In [None]:

!pip install texthero



In [None]:
import importlib
# Standard imports
import numpy  as np
import pandas as pd

import itertools
import random
import math  
import copy

from pprint import pprint  # nicer printing

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Other NLP
import re
import spacy
import nltk
from nltk.corpus import stopwords

# General Plotting
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as patches
%matplotlib inline  
import seaborn as sns
sns.set(style="white")

# Special Plotting
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

# ignore some warnings 
import warnings
warnings.filterwarnings('ignore')

# Set the maximum number of rows displayed by pandas
pd.options.display.max_rows = 1000

# Set some CONSTANTS that will be used later
SEED    = 41  # base to generate a random number
SCORE   = 'roc_auc'
FIGSIZE = (16, 10)

In [None]:
import texthero as hero


In [None]:
data2=copy.copy(merged_df_textual)

In [None]:
#data2=data2.sample(n=100000, random_state=1)

In [None]:
data2.head(4)

In [None]:
from tqdm import tqdm

from tqdm._tqdm_notebook import tqdm_notebook
for column in tqdm(data2.columns):
    tqdm_notebook.pandas()
    data2[column] = data2[column].pipe(hero.clean)

In [None]:
data2.head()

In [None]:
data2.head()

In [None]:
#kernel died even though i tried 5 ttimes
"""data2['tfidf'] = (
    hero.tfidf(data2['essays']))"""

In [None]:
#kernel died even though i tried 5 ttimes
"""
data2['pca'] = hero.pca(data2['tfidf'])
hero.scatterplot(
    df, 
    col='pca', 
    color='topic', 
    title="PCA BBC Sport news"
)"""

**Note: This is worth 5 points (the value of 5 individual problems).**

OpenSourcve packages rely on the community of users to help them grow and improve. Review the [contributing file](https://github.com/jbesomi/texthero/blob/master/CONTRIBUTING.md) for Text Hero and then identify a portion of the documentation that you feel could be improves. Edit/write 1 paragraph of documentation for the package that you believe would improve it. Copy that paragraph in below (into this notebook) so that it can be graded. And - if you think your contribution would truly help the project, please learn how to use git to suggest the change (a pull request) to the manager of the repository (Jonathan Besomi). 

Documentation for nmf

texthero.representation.nmf

nmf(s, n_components=2)

    Perform non-negative matrix factorization.

Find two non-negative matrices (W, H) whose product approximates the non-negative matrix X. 
This factorization can be used for example for dimensionality reduction, source separation or topic extraction.

Parameters

    s: Pandas Series
    n_components: Int. Default is 2.
        Number of components to keep. If n_components is not set or None, all components are kept.

Examples

import texthero as hero

import pandas as pd

s = pd.Series(["Sentence one", "Sentence two"])

custom_pipeline = [preprocessing.lowercase,
                   preprocessing.remove_whitespace]
s= hero.clean(s, custom_pipeline)
s_tf_idf = hero.tfidf(s)
pca=hero.nmf(s_tf_idf)

Documentation for pca

Example:
    
import texthero as hero

import pandas as pd

s = pd.Series(["Sentence one", "Sentence two"])

custom_pipeline = [preprocessing.lowercase,
                   preprocessing.remove_whitespace]
                   
s= hero.clean(s, custom_pipeline)

s_tf_idf = hero.tfidf(s)

pca=hero.pca(s_tf_idf)
