## OVERVIEW OF OSEMiN

<img src='https://raw.githubusercontent.com/jirvingphd/fsds_100719_cohort_notes/master/images/OSEMN.png' width=800>

<center><a href="https://www.kdnuggets.com/2018/02/data-science-command-line-book-exploring-data.html"> 
    </a></center>


> <font size=2em>The Data Science Process we'll be using during this section--OSEMiN (pronounced "OH-sum", rhymes with "possum").  This is the most straightforward of the Data Science Processes discussed so far.  **Note that during this process, just like the others, the stages often blur together.***  It is completely acceptable (and ***often a best practice!) to float back and forth** between stages as you learn new things about your problem, dataset, requirements, etc.  
It's quite common to get to the modeling step and realize that you need to scrub your data a bit more or engineer a different feature and jump back to the "Scrub" stage, or go all the way back to the "Obtain" stage when you realize your current data isn't sufficient to solve this problem. 
As with any of these frameworks, *OSEMiN is meant to be treated as guidelines, not law. 
</font>


### OSEMN DETAILS

**OBTAIN**

- This step involves understanding stakeholder requirements, gathering information on the problem, and finally sourcing data that we think will be necessary for solving this problem. 

**SCRUB**

- During this stage, we'll focus on preprocessing our data.  Important steps such as identifying and removing null values, dealing with outliers, normalizing data, and feature engineering/feature selection are handled around this stage.  The line with this stage really blurs with the _Explore_ stage, as it is common to only realize that certain columns require cleaning or preprocessing as a result of the visualzations and explorations done during Step 3.  

- Note that although technically, categorical data should be one-hot encoded during this step, in practice, it's usually done after data exploration.  This is because it is much less time-consuming to visualize and explore a few columns containing categorical data than it is to explore many different dummy columns that have been one-hot encoded. 

**EXPLORE**

- This step focuses on getting to know the dataset you're working with. As mentioned above, this step tends to blend with the _Scrub_ step mentioned above.  During this step, you'll create visualizations to really get a feel for your dataset.  You'll focus on things such as understanding the distribution of different columns, checking for multicollinearity, and other tasks liek that.  If your project is a classification task, you may check the balance of the different classes in your dataset.  If your problem is a regression task, you may check that the dataset meets the assumptions necessary for a regression task.  

- At the end of this step, you should have a dataset ready for modeling that you've thoroughly explored and are extremely familiar with.  

**MODEL**

- This step, as with the last two frameworks, is also pretty self-explanatory. It consists of building and tuning models using all the tools you have in your data science toolbox.  In practice, this often means defining a threshold for success, selecting machine learning algorithms to test on the project, and tuning the ones that show promise to try and increase your results.  As with the other stages, it is both common and accepted to realize something, jump back to a previous stage like _Scrub_ or _Explore_, and make some changes to see how it affects the model.  

**iNTERPRET**

- During this step, you'll interpret the results of your model(s), and communicate results to stakeholders.  As with the other frameworks, communication is incredibily important! During this stage, you may come to realize that further investigation is needed, or more data.  That's totally fine--figure out what's needed, go get it, and start the process over! If your results are satisfactory to all stakeholders involved, you may also go from this stage right into productionizing your model and automating processes necessary to support it.  





## PROCESS CHECKLIST


> Keep in mind that it is normal to jump between the OSEMN phases and some of them will blend together, like SCRUB and EXPLORE.

1. **[OBTAIN](#OBTAIN)**
    - Import data, inspect, check for datatypes to convert and null values
    - Display header and info.
    - Drop any unneeded columns, if known (`df.drop(['col1','col2'],axis=1,inplace=True`)
    <br><br>


2. **[SCRUB](#SCRUB)**
    - Recast data types, identify outliers, check for multicollinearity, normalize data**
    - Check and cast data types
        - [ ] Check for #'s that are store as objects (`df.info()`,`df.describe()`)
            - when converting to #'s, look for odd values (like many 0's), or strings that can't be converted.
            - Decide how to deal weird/null values (`df.unique()`, `df.isna().sum()`)
            - `df.fillna(subset=['col_with_nulls'],'fill_value')`, `df.replace()`
        - [ ] Check for categorical variables stored as integers.
            - May be easier to tell when you make a scatter plotm or `pd.plotting.scatter_matrix()`
            
    - [ ] Check for missing values  (df.isna().sum())
        - Can drop rows or colums
        - For missing numeric data with median or bin/convert to categorical
        - For missing categorical data: make NaN own category OR replace with most common category
    - [ ] Check for multicollinearity
        - Use seaborn to make correlation matrix plot 
        - Good rule of thumb is anything over 0.75 corr is high, remove the variable that has the most correl with the largest # of variables
    - [ ] Normalize data (may want to do after some exploring)
        - Most popular is Z-scoring (but won't fix skew) 
        - Can log-transform to fix skewed data
    
    
3. **[EXPLORE](#EXPLORE)**
    - [ ] Check distributions, outliers, etc**
    - [ ] Check scales, ranges (df.describe())
    - [ ] Check histograms to get an idea of distributions (df.hist()) and data transformations to perform.
        - Can also do kernel density estimates
    - [ ] Use scatter plots to check for linearity and possible categorical variables (`df.plot("x","y")`)
        - categoricals will look like vertical lines
    - [ ] Use `pd.plotting.scatter_matrix(df)` to visualize possible relationships
    - [ ] Check for linearity.
   
   
4. **[MODEL](#MODEL)**

    - **Fit an initial model:** 
        - Run an initial model and get results

    - **Holdout validation / Train/test split**
        - use sklearn `train_test_split`
    
    
5. **[iNTERPRET](#iNTERPRET)**
    - **Assessing the model:**
        - Assess parameters (slope,intercept)
        - Check if the model explains the variation in the data (RMSE, F, R_square)
        - *Are the coeffs, slopes, intercepts in appropriate units?*
        - *Whats the impact of collinearity? Can we ignore?*
        <br><br>
    - **Revise the fitted model**
        - Multicollinearity is big issue for lin regression and cannot fully remove it
        - Use the predictive ability of model to test it (like R2 and RMSE)
        - Check for missed non-linearity
        
       
6. **Interpret final model and draw >=3 conclusions and recommendations from dataset**

<div style="display:block;border-bottom:solid red 3px;padding:1.4em;color:red;font-size:30pt;display:inline-block;line-height:1.5em;">
DELETE THIS CELL AND EVERYTHING ABOVE FROM YOUR FINAL NOTEBOOK
</div>

# Final Project Submission

Please fill out:
* Student name: 
* Student pace: self paced / part time / full time:
* Scheduled project review date/time: 
* Instructor name: 
* Blog post URL:
* Video of 5-min Non-Technical Presentation:

## TABLE OF CONTENTS 

*Click to jump to matching Markdown Header.*<br><br>
 
- **[Introduction](#INTRODUCTION)<br>**
- **[OBTAIN](#OBTAIN)**<br>
- **[SCRUB](#SCRUB)**<br>
- **[EXPLORE](#EXPLORE)**<br>
- **[MODEL](#MODEL)**<br>
- **[iNTERPRET](#iNTERPRET)**<br>
- **[Conclusions/Recommendations](#CONCLUSIONS-&-RECOMMENDATIONS)<br>**
___

# INTRODUCTION

## Business Problem

Social media presence is an important part of a modern brand's marketing strategy. But these platforms not only allow brands to broadcast their messages directly to consumers, they also allow consumers to voice their feedback and opinions about the brands candidly, and in a public forum. This translates to a responsibility on one hand, but also an opportunity on the other hand, for companies to listen to and respond to feedback from customers.

Many companies use the Net Promoter Score (NPS) as a measure of customer satisfaction and loyalty. However, NPS survey response rates can be low (15%-20% is considered decent) and non-response bias makes the resulting scores unreliable. NPS is also usually just a single question asking how likely a customer is to recommend the company's product to someone else; they may not provide the mechanism for the respondent to give specific feedback about what lead to their answer. Analysis of other channels where customers provide feedback, such as Twitter, could supplement frequently sparse NPS data.

If technology could take a first pass on determining the sentiment of tweets, large companies would have a better chance at winnowing constructive, actionable feedback from trolling or irrelevant comments.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***



# OBTAIN

## Data Understanding

This data comes from [CrowdFlower](https://data.world/crowdflower/brands-and-product-emotions). It consists of a corpus of tweets which humans were asked to label according to whether they were related to a particular brand or product, and whether a positive, negative, or no emotion was expressed. Tweets about brands or products were labeled with the specific brand or product.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***


In [398]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import html
import string
from chardet.universaldetector import UniversalDetector

import nltk
from nltk.probability import FreqDist
from nltk.tokenize import TweetTokenizer, word_tokenize, wordpunct_tokenize
from nltk.corpus import stopwords

from wordcloud import WordCloud

import spacy

from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer,\
        CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

%matplotlib inline

ModuleNotFoundError: No module named 'wordcloud'

In [347]:
#nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jessicamiles/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [348]:
pd.set_option("display.max_colwidth",150)

In [349]:
# try to detect chatacter encoding of the file
detector = UniversalDetector()

for line in open('data/judge-1377884607_tweet_product_company.csv', 'r+b').readlines():
    #print(line)
    detector.feed(line)
    if detector.done: break
        
detector.close()
print(detector.result)


{'encoding': 'Windows-1254', 'confidence': 0.43036719349968755, 'language': 'Turkish'}


That wasn't especially helpful. I tried cp1254 codec and it was not successful.

In [350]:
# read in data. Had to switch to latin_1 encoding because encountered errors
# with default UTF-8 and windows 1254

df = pd.read_csv('data/judge-1377884607_tweet_product_company.csv',
                encoding='latin_1')
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",iPhone,Negative emotion
1,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW",iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,iPad or iPhone App,Negative emotion
4,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)",Google,Positive emotion


In [351]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [352]:
# check out the one null in the tweet_text column
df.loc[df['tweet_text'].isna()]

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
6,,,No emotion toward brand or product


In [353]:
# verified it's blank in the source CSV as well. Going to drop it.
df.dropna(subset=['tweet_text'], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9092 non-null   object
dtypes: object(3)
memory usage: 284.1+ KB


In [354]:
# rename the columns to be less verbose
col_dict = {'emotion_in_tweet_is_directed_at':'product',
           'is_there_an_emotion_directed_at_a_brand_or_product':'emotion'}
df.rename(columns=col_dict, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   tweet_text  9092 non-null   object
 1   product     3291 non-null   object
 2   emotion     9092 non-null   object
dtypes: object(3)
memory usage: 284.1+ KB


In [355]:
# what are the values in the product column? How do they match up to emotions?
df.groupby(by=['emotion', 'product'], dropna=False).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,tweet_text
emotion,product,Unnamed: 2_level_1
I can't tell,Apple,2
I can't tell,Google,1
I can't tell,Other Google product or service,1
I can't tell,iPad,4
I can't tell,iPhone,1
I can't tell,,147
Negative emotion,Android,8
Negative emotion,Android App,8
Negative emotion,Apple,95
Negative emotion,Google,68


In [356]:
# let's get a sample of "I can't tell"
df[df['emotion']=="I can't tell"]['tweet_text'].values

array(['Thanks to @mention for publishing the news of @mention new medical Apps at the #sxswi conf. blog {link} #sxsw #sxswh',
       '\x89ÛÏ@mention &quot;Apple has opened a pop-up store in Austin so the nerds in town for #SXSW can get their new iPads. {link} #wow',
       'Just what America needs. RT @mention Google to Launch Major New Social Network Called Circles, Possibly Today {link} #sxsw',
       'The queue at the Apple Store in Austin is FOUR blocks long. Crazy stuff! #sxsw',
       "Hope it's better than wave RT @mention Buzz is: Google's previewing a social networking platform at #SXSW: {link}",
       'SYD #SXSW crew your iPhone extra juice pods have been procured.',
       'Why Barry Diller thinks iPad only content is nuts @mention #SXSW {link}',
       'Gave into extreme temptation at #SXSW and bought an iPad 2... #impulse',
       'Catch 22\x89Û_ I mean iPad 2 at #SXSW : {link}',
       'Forgot my iPhone for #sxsw. Android only. Knife to a gun fight',
       'Kawasaki: k

I'm going to put all of the tweets through tokenization and preprocessing. I will train my model on the tweets which are labeled positive or negative emotions, but I might want to use the "no emotion" and "I can't tell" labeled tweets to help confirm the results.

# SCRUB

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [357]:
# Make a copy of the tweet text, so I can keep the original pristine
df['cleaned'] = df['tweet_text']
df['cleaned'].head()

0                .@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.
1    @jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW
2                                                                @swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.
3                                                             @sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw
4            @sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)
Name: cleaned, dtype: object

## Clean text as documents

Before splitting into words, I'm going to do some text cleaning on the whole documents.

I noticed when taking an initial look at the corpus that I have some existing placeholders which were probably added when anonymizing this dataset for the public:
- @mention
- {link}

I'm going to remove the links because as words they won't be especially helpful, but I do want to keep track of tweets that had links, since that might be an interesting feature to examine.

Most of the @mentions appear to be placeholders, but not all are, some are the original handles. I think it's worth capturing the mentions in a column as well as hashtags, so these can be analyzed separately. I'll do the preprocessing in a specific order, so as to replace some of these and capture the information I want before replacing the rest. 

I also have some characters with encoding that I can't replicate; they could be emojis but in some instances they look to be unicode apostrophes and quotation marks. I'll replace these, since I can't find any way to easily get them encoded correctly.

Finally, I did also notice that some people put Twitter abbreviations such as "RT" for retweet. I may engineer a feature indicating whether a tweet is a retweet, and remove that text.

In [358]:
# check for literal (unescaped) open or closing HTML tags
df[df['tweet_text'].str.contains("[<>]")]

Unnamed: 0,tweet_text,product,emotion,cleaned


Since there are no "<>" characters in any of the tweets, I will not add anything to remove HTML tags from the text when I clean it. 

In [359]:
# check for links or URLs
df[df['tweet_text'].str.contains("http[^ ]+|www\.[^ ]+")]

Unnamed: 0,tweet_text,product,emotion,cleaned
5,@teachntech00 New iPad Apps For #SpeechTherapy And Communication Are Showcased At The #SXSW Conference http://ht.ly/49n4M #iear #edchat #asd,,No emotion toward brand or product,@teachntech00 New iPad Apps For #SpeechTherapy And Communication Are Showcased At The #SXSW Conference http://ht.ly/49n4M #iear #edchat #asd
8,Beautifully smart and simple idea RT @madebymany @thenextweb wrote about our #hollergram iPad app for #sxsw! http://bit.ly/ieaVOB,iPad or iPhone App,Positive emotion,Beautifully smart and simple idea RT @madebymany @thenextweb wrote about our #hollergram iPad app for #sxsw! http://bit.ly/ieaVOB
11,Find &amp; Start Impromptu Parties at #SXSW With @HurricaneParty http://bit.ly/gVLrIn I can't wait til the Android app comes out.,Android App,Positive emotion,Find &amp; Start Impromptu Parties at #SXSW With @HurricaneParty http://bit.ly/gVLrIn I can't wait til the Android app comes out.
12,"Foursquare ups the game, just in time for #SXSW http://j.mp/grN7pK) - Still prefer @Gowalla by far, best looking Android app to date.",Android App,Positive emotion,"Foursquare ups the game, just in time for #SXSW http://j.mp/grN7pK) - Still prefer @Gowalla by far, best looking Android app to date."
13,Gotta love this #SXSW Google Calendar featuring top parties/ show cases to check out. RT @hamsandwich via @ischafer =&gt;http://bit.ly/aXZwxB,Other Google product or service,Positive emotion,Gotta love this #SXSW Google Calendar featuring top parties/ show cases to check out. RT @hamsandwich via @ischafer =&gt;http://bit.ly/aXZwxB
14,Great #sxsw ipad app from @madebymany: http://tinyurl.com/4nqv92l,iPad or iPhone App,Positive emotion,Great #sxsw ipad app from @madebymany: http://tinyurl.com/4nqv92l
15,"haha, awesomely rad iPad app by @madebymany http://bit.ly/hTdFim #hollergram #sxsw",iPad or iPhone App,Positive emotion,"haha, awesomely rad iPad app by @madebymany http://bit.ly/hTdFim #hollergram #sxsw"
16,Holler Gram for iPad on the iTunes App Store - http://t.co/kfN3f5Q (via @marc_is_ken) #sxsw,,No emotion toward brand or product,Holler Gram for iPad on the iTunes App Store - http://t.co/kfN3f5Q (via @marc_is_ken) #sxsw
19,Must have #SXSW app! RT @malbonster: Lovely review from Forbes for our SXSW iPad app Holler Gram - http://t.co/g4GZypV,iPad or iPhone App,Positive emotion,Must have #SXSW app! RT @malbonster: Lovely review from Forbes for our SXSW iPad app Holler Gram - http://t.co/g4GZypV
23,"Photo: Just installed the #SXSW iPhone app, which is really nice! http://tumblr.com/x6t1pi6av7",iPad or iPhone App,Positive emotion,"Photo: Just installed the #SXSW iPhone app, which is really nice! http://tumblr.com/x6t1pi6av7"


I definitely do have URLs, which I will remove.

In [360]:
# check for non-ASCII characters
# regex from https://stackoverflow.com/questions/2124010/grep-regex-to-match-non-ascii-characters

df[df['tweet_text'].str.contains('[^\x00-\x7F]+')]

Unnamed: 0,tweet_text,product,emotion,cleaned
38,@mention - False Alarm: Google Circles Not Coming NowÛÒand Probably Not Ever? - {link} #Google #Circles #Social #SXSW,Google,Negative emotion,@mention - False Alarm: Google Circles Not Coming NowÛÒand Probably Not Ever? - {link} #Google #Circles #Social #SXSW
41,"HootSuite - HootSuite Mobile for #SXSW ~ Updates for iPhone, BlackBerry &amp; Android: Whether youÛªre getting friend... {link}",,No emotion toward brand or product,"HootSuite - HootSuite Mobile for #SXSW ~ Updates for iPhone, BlackBerry &amp; Android: Whether youÛªre getting friend... {link}"
42,Hey #SXSW - How long do you think it takes us to make an iPhone case? answer @mention using #zazzlesxsw and weÛªll make you one!,,No emotion toward brand or product,Hey #SXSW - How long do you think it takes us to make an iPhone case? answer @mention using #zazzlesxsw and weÛªll make you one!
45,#IPad2 's Û÷#SmartCoverÛª Opens to Instant Access - I should have waited to get one! - {link} #apple #SXSW,iPad or iPhone App,Positive emotion,#IPad2 's Û÷#SmartCoverÛª Opens to Instant Access - I should have waited to get one! - {link} #apple #SXSW
46,Hand-Held Û÷HoboÛª: Drafthouse launches Û÷Hobo With a ShotgunÛª iPhone app #SXSW {link},,Positive emotion,Hand-Held Û÷HoboÛª: Drafthouse launches Û÷Hobo With a ShotgunÛª iPhone app #SXSW {link}
...,...,...,...,...
8925,umm that would be @mention ÛÏ@mention I keep winning shit! Thanks @mention for the killer iPad case. #sxswÛ,Other Apple product or service,Positive emotion,umm that would be @mention ÛÏ@mention I keep winning shit! Thanks @mention for the killer iPad case. #sxswÛ
8945,FestivalExplorer iPhone App Finally Solves SXSW {link} #music #musica #musiek #musique #musik #app #sxsw #Ù_¾¬â #Ù_¾´_ #Î¥É,iPad or iPhone App,Positive emotion,FestivalExplorer iPhone App Finally Solves SXSW {link} #music #musica #musiek #musique #musik #app #sxsw #Ù_¾¬â #Ù_¾´_ #Î¥É
8963,"Group #Texting War Heats Up: Fast Society Launches New Android App, Updates iPhone App: #SXSWÛ_ {link}",Android App,Positive emotion,"Group #Texting War Heats Up: Fast Society Launches New Android App, Updates iPhone App: #SXSWÛ_ {link}"
8982,"In case my fairy god mother = reading mail; my ÌÙ±G wish this week is 2 go 2 #sxsw Ï for the #Android ÏÎ Dev Ïà Meetup. @mention Hilton, Sat....",,No emotion toward brand or product,"In case my fairy god mother = reading mail; my ÌÙ±G wish this week is 2 go 2 #sxsw Ï for the #Android ÏÎ Dev Ïà Meetup. @mention Hilton, Sat...."


In [361]:
doc = df.at[37, 'tweet_text']
doc = doc.encode('ascii', 'ignore').decode()
doc

'SPIN Play - a new concept in music discovery for your iPad from @mention &amp; spin.com {link} #iTunes #sxsw @mention'

The problem I have with the non-ASCII characters is that they often seem to represent apostrophes and quotation marks. I can't find an encoding that will show them properly in python, or in a SublimeText when I open the raw file. 

For now I'm going to replace the non-ASCII characters with spaces because I think replacing them with nothing will lead to weird words.

In [362]:
def clean_docs1(doc):
    """
    """
    # unescape HTML characters
    doc = html.unescape(doc)
    
    # remove URLs and links, replacing them with existing placeholder
    urls = re.findall("http[^ ]+|www\.[^ ]+", doc)
    for url in urls:
        doc = str.replace(doc, url, '{link}')
    
    # replace non-ASCII characters with space
    doc = re.sub(r"[^\x00-\x7F]+", ' ', doc)
    
    # replace ASCII control characters with space
    doc = re.sub(r"[\x00-\x1F]", ' ', doc)
    
    # remove multiple spaces, which will exist after all this replacing words
    doc = re.sub(r"[ ]{2,}", ' ', doc)
    
    return doc

In [363]:
# map doc cleaning function onto 'cleaned' column
df['cleaned'] = df['cleaned'].map(lambda x: clean_docs1(x))
df

Unnamed: 0,tweet_text,product,emotion,cleaned
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",iPhone,Negative emotion,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW."
1,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW",iPad or iPhone App,Positive emotion,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW"
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,iPad,Positive emotion,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.
3,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,iPad or iPhone App,Negative emotion,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw
4,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)",Google,Positive emotion,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) & Matt Mullenweg (Wordpress)"
...,...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion,Ipad everywhere. #SXSW {link}
9089,"Wave, buzz... RT @mention We interrupt your regularly scheduled #sxsw geek programming with big news {link} #google #circles",,No emotion toward brand or product,"Wave, buzz... RT @mention We interrupt your regularly scheduled #sxsw geek programming with big news {link} #google #circles"
9090,"Google's Zeiger, a physician never reported potential AE. Yet FDA relies on physicians. &quot;We're operating w/out data.&quot; #sxsw #health2dev",,No emotion toward brand or product,"Google's Zeiger, a physician never reported potential AE. Yet FDA relies on physicians. ""We're operating w/out data."" #sxsw #health2dev"
9091,Some Verizon iPhone customers complained their time fell back an hour this weekend. Of course they were the New Yorkers who attended #SXSW.,,No emotion toward brand or product,Some Verizon iPhone customers complained their time fell back an hour this weekend. Of course they were the New Yorkers who attended #SXSW.


In [364]:
# specifically check on a few rows
df.loc[8982]
df.loc[7350]

tweet_text    Saw a company today ready to launch, sounds a lot like Google Circles, but with actual personal privacy www.mycube.com #sxsw
product                                                                                                    Other Google product or service
emotion                                                                                                 No emotion toward brand or product
cleaned               Saw a company today ready to launch, sounds a lot like Google Circles, but with actual personal privacy {link} #sxsw
Name: 7350, dtype: object

In [365]:
doc = df['cleaned'].loc[1]
doc

"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW"

In [366]:
def get_pattern_hits(doc, pattern, out_type):
    """
    
    """
    
    # determine the variable type for recording hits
    if out_type=='list': 
        hits = []
    elif out_type=='bool':
        hits = False
    elif out_type=='none':
        hits = None
        
    # search for regex pattern in doc
    pattern_hits = re.findall(pattern, doc)

    if len(pattern_hits) > 0:
        # replace the hits in the original doc string
        # need to use the re version otherwise substrings won't be replaced 
        # correctly!
        doc = re.sub(pattern, ' ', doc)

        # replace multiple spaces with a single space
        doc = re.sub(r"(\s{2,})", ' ', doc)

        # Update appropriate hits variable
        for hit in pattern_hits:
            if out_type=='list':
                hits.append(hit)
            elif out_type=='bool':
                hits = True

        if out_type=='list':
            hits = list(set(hits))
                 
    return doc, hits

In [367]:
def log_remove(df, doc_col, hit_col, pattern, out_type='list'):
    """
    """
    updates = []
    
    # loop through each row in the dataframe to process its record
    for i in df.index:
        new_doc, hits = get_pattern_hits(df.at[i, doc_col], pattern, out_type)
        updates.append([new_doc, hits])
    
    df_new = pd.DataFrame(updates, columns=[doc_col, hit_col])
    
    if out_type=='none':
        df_new.drop(columns=[hit_col], inplace=True)
    
    df = df.join(df_new, lsuffix='_old', how='inner')
    df.drop(columns=[f"{doc_col}_old"], inplace=True)
    return df


In [368]:
# clean and log @mentions
df.reset_index(inplace=True, drop=True)
df = log_remove(df, doc_col='cleaned', hit_col='mentions', 
                pattern=r"(?:^|\s)(@[a-zA-Z0-9_-]+)")
df[['tweet_text', 'cleaned', 'mentions']]

Unnamed: 0,tweet_text,cleaned,mentions
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",[]
1,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW","Know about ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW","[@fludapp, @jessedee]"
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,Can not wait for #iPad 2 also. They should sale them down at #SXSW.,[@swonderlin]
3,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,[@sxsw]
4,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)","great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) & Matt Mullenweg (Wordpress)",[@sxtxstate]
...,...,...,...
9087,Ipad everywhere. #SXSW {link},Ipad everywhere. #SXSW {link},[]
9088,"Wave, buzz... RT @mention We interrupt your regularly scheduled #sxsw geek programming with big news {link} #google #circles","Wave, buzz... RT We interrupt your regularly scheduled #sxsw geek programming with big news {link} #google #circles",[@mention]
9089,"Google's Zeiger, a physician never reported potential AE. Yet FDA relies on physicians. &quot;We're operating w/out data.&quot; #sxsw #health2dev","Google's Zeiger, a physician never reported potential AE. Yet FDA relies on physicians. ""We're operating w/out data."" #sxsw #health2dev",[]
9090,Some Verizon iPhone customers complained their time fell back an hour this weekend. Of course they were the New Yorkers who attended #SXSW.,Some Verizon iPhone customers complained their time fell back an hour this weekend. Of course they were the New Yorkers who attended #SXSW.,[]


In [369]:
# clean and remove hashtags
df = log_remove(df, 'cleaned', 'hashtags', 
                pattern=r"(?:^|\s)(#[a-zA-Z0-9_-]+)")
df[['tweet_text', 'cleaned', 'hashtags']]

Unnamed: 0,tweet_text,cleaned,hashtags
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at , it was dead! I need to upgrade. Plugin stations at .","[#SXSW, #RISE_Austin]"
1,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW","Know about ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at",[#SXSW]
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,Can not wait for 2 also. They should sale them down at .,"[#SXSW, #iPad]"
3,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,I hope this year's festival isn't as crashy as this year's iPhone app.,[#sxsw]
4,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)","great stuff on Fri : Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) & Matt Mullenweg (Wordpress)",[#SXSW]
...,...,...,...
9087,Ipad everywhere. #SXSW {link},Ipad everywhere. {link},[#SXSW]
9088,"Wave, buzz... RT @mention We interrupt your regularly scheduled #sxsw geek programming with big news {link} #google #circles","Wave, buzz... RT We interrupt your regularly scheduled geek programming with big news {link}","[#google, #circles, #sxsw]"
9089,"Google's Zeiger, a physician never reported potential AE. Yet FDA relies on physicians. &quot;We're operating w/out data.&quot; #sxsw #health2dev","Google's Zeiger, a physician never reported potential AE. Yet FDA relies on physicians. ""We're operating w/out data.""","[#health2dev, #sxsw]"
9090,Some Verizon iPhone customers complained their time fell back an hour this weekend. Of course they were the New Yorkers who attended #SXSW.,Some Verizon iPhone customers complained their time fell back an hour this weekend. Of course they were the New Yorkers who attended .,[#SXSW]


In [370]:
# clean and remove {link} placeholders
df = log_remove(df, 'cleaned', 'links', pattern=r"(?:^|\s)(\{link\})", 
                out_type='bool')
df[['tweet_text', 'cleaned', 'links']]

Unnamed: 0,tweet_text,cleaned,links
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at , it was dead! I need to upgrade. Plugin stations at .",False
1,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW","Know about ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at",False
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,Can not wait for 2 also. They should sale them down at .,False
3,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,I hope this year's festival isn't as crashy as this year's iPhone app.,False
4,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)","great stuff on Fri : Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) & Matt Mullenweg (Wordpress)",False
...,...,...,...
9087,Ipad everywhere. #SXSW {link},Ipad everywhere.,True
9088,"Wave, buzz... RT @mention We interrupt your regularly scheduled #sxsw geek programming with big news {link} #google #circles","Wave, buzz... RT We interrupt your regularly scheduled geek programming with big news",True
9089,"Google's Zeiger, a physician never reported potential AE. Yet FDA relies on physicians. &quot;We're operating w/out data.&quot; #sxsw #health2dev","Google's Zeiger, a physician never reported potential AE. Yet FDA relies on physicians. ""We're operating w/out data.""",False
9090,Some Verizon iPhone customers complained their time fell back an hour this weekend. Of course they were the New Yorkers who attended #SXSW.,Some Verizon iPhone customers complained their time fell back an hour this weekend. Of course they were the New Yorkers who attended .,False


In [371]:
# clean and remove RT placeholders
df = log_remove(df, 'cleaned', 'RT', pattern=r"(?:^|\s)\b(RT)\b", out_type='bool')
df[['tweet_text', 'cleaned', 'RT']]

Unnamed: 0,tweet_text,cleaned,RT
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at , it was dead! I need to upgrade. Plugin stations at .",False
1,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW","Know about ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at",False
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,Can not wait for 2 also. They should sale them down at .,False
3,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,I hope this year's festival isn't as crashy as this year's iPhone app.,False
4,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)","great stuff on Fri : Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) & Matt Mullenweg (Wordpress)",False
...,...,...,...
9087,Ipad everywhere. #SXSW {link},Ipad everywhere.,False
9088,"Wave, buzz... RT @mention We interrupt your regularly scheduled #sxsw geek programming with big news {link} #google #circles","Wave, buzz... We interrupt your regularly scheduled geek programming with big news",True
9089,"Google's Zeiger, a physician never reported potential AE. Yet FDA relies on physicians. &quot;We're operating w/out data.&quot; #sxsw #health2dev","Google's Zeiger, a physician never reported potential AE. Yet FDA relies on physicians. ""We're operating w/out data.""",False
9090,Some Verizon iPhone customers complained their time fell back an hour this weekend. Of course they were the New Yorkers who attended #SXSW.,Some Verizon iPhone customers complained their time fell back an hour this weekend. Of course they were the New Yorkers who attended .,False


In [372]:
# remove numbers
df = log_remove(df, 'cleaned', 'none', 
                pattern=r"(?:^|\s)([.:$%]*[0-9]+[.:$%]*[0-9]*)\b", 
                out_type='none')
df[['tweet_text', 'cleaned']]

Unnamed: 0,tweet_text,cleaned
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",".@wesley83 I have a 3G iPhone. After hrs tweeting at , it was dead! I need to upgrade. Plugin stations at ."
1,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW","Know about ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at"
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,Can not wait for also. They should sale them down at .
3,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,I hope this year's festival isn't as crashy as this year's iPhone app.
4,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)","great stuff on Fri : Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) & Matt Mullenweg (Wordpress)"
...,...,...
9087,Ipad everywhere. #SXSW {link},Ipad everywhere.
9088,"Wave, buzz... RT @mention We interrupt your regularly scheduled #sxsw geek programming with big news {link} #google #circles","Wave, buzz... We interrupt your regularly scheduled geek programming with big news"
9089,"Google's Zeiger, a physician never reported potential AE. Yet FDA relies on physicians. &quot;We're operating w/out data.&quot; #sxsw #health2dev","Google's Zeiger, a physician never reported potential AE. Yet FDA relies on physicians. ""We're operating w/out data."""
9090,Some Verizon iPhone customers complained their time fell back an hour this weekend. Of course they were the New Yorkers who attended #SXSW.,Some Verizon iPhone customers complained their time fell back an hour this weekend. Of course they were the New Yorkers who attended .


In [373]:
# I don't see any numbers in the initial cells, so let's find some
df.loc[df['tweet_text'].str.contains(r"(?:^|\s)([.:$%]*[0-9]+[.:$%]*[0-9]*)\s"), 
       ['tweet_text', 'cleaned']]

  return func(self, *args, **kwargs)


Unnamed: 0,tweet_text,cleaned
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",".@wesley83 I have a 3G iPhone. After hrs tweeting at , it was dead! I need to upgrade. Plugin stations at ."
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,Can not wait for also. They should sale them down at .
23,Really enjoying the changes in Gowalla 3.0 for Android! Looking forward to seeing what else they &amp; Foursquare have up their sleeves at #SXSW,Really enjoying the changes in Gowalla for Android! Looking forward to seeing what else they & Foursquare have up their sleeves at
28,"They were right, the @gowalla 3 app on #android is sweeeeet! Nice job by the team there. #sxsw","They were right, the app on is sweeeeet! Nice job by the team there."
42,Mashable! - The iPad 2 Takes Over SXSW [VIDEO] #ipad #sxsw #gadgets {link},Mashable! - The iPad Takes Over SXSW [VIDEO]
...,...,...
9051,@mention You could buy a new iPad 2 tmrw at the Apple pop-up store at #sxsw: {link},You could buy a new iPad tmrw at the Apple pop-up store at :
9053,"Guys, if you ever plan on attending #SXSW, you need 4 things, skinny jeans, flannel shirt, beard and an iPad #imanoutcast...","Guys, if you ever plan on attending , you need things, skinny jeans, flannel shirt, beard and an iPad ..."
9062,@mention You should get the iPad 2 to save your back from lugging the laptop #SXSW #SXSWMyMistake,You should get the iPad to save your back from lugging the laptop
9071,@mention your iPhone 4 cases are Rad and Ready! Stop by tomorrow to get them! #Sxsw #zazzlesxsw #sxswi {link},your iPhone cases are Rad and Ready! Stop by tomorrow to get them!


In [374]:
# check for standalone 2's, which I found after processing a little further.
# needed to adjust the regex to accomodate numbers at the end of the doc
# or end of a sentence
df[df['cleaned'].str.contains(r"2")]

Unnamed: 0,tweet_text,product,emotion,mentions,hashtags,links,RT,cleaned
19,Need to buy an iPad2 while I'm in Austin at #sxsw. Not sure if I'll need to Q up at an Austin Apple store?,iPad,Positive emotion,[],[#sxsw],False,False,Need to buy an iPad2 while I'm in Austin at . Not sure if I'll need to Q up at an Austin Apple store?
39,@mention - Great weather to greet you for #sxsw! Still need a sweater at night..Apple putting up &quot;flash store&quot; downtown to sell iPad2,Apple,Positive emotion,[@mention],[#sxsw],False,False,"- Great weather to greet you for ! Still need a sweater at night..Apple putting up ""flash store"" downtown to sell iPad2"
77,"iPad2? RT @mention Droid &amp; Mac here :) RT @mention My #agnerd confession, using laptop, iPad &amp; blackberry to follow #SXSW",,No emotion toward brand or product,[@mention],"[#SXSW, #agnerd]",False,True,"iPad2? Droid & Mac here :) My confession, using laptop, iPad & blackberry to follow"
146,#fastball #sxsw Giving away two NEW Ipad2 wifi 32g black Apple cover tweet @mention fo more info #sxswi #attsxsw Tonight @mention bo.lt house,,No emotion toward brand or product,[@mention],"[#fastball, #sxswi, #sxsw, #attsxsw]",False,False,Giving away two NEW Ipad2 wifi 32g black Apple cover tweet fo more info Tonight bo.lt house
171,ipad2 and #sxsw...a conflagration of doofusness. {link},iPad,Negative emotion,[],[#sxsw],True,False,ipad2 and ...a conflagration of doofusness.
...,...,...,...,...,...,...,...,...
8959,#japan #SXSW put you collective entrepreneurial and social minds and iPad 2s together and do something for Japan,,No emotion toward brand or product,[],"[#SXSW, #japan]",False,False,put you collective entrepreneurial and social minds and iPad 2s together and do something for Japan
8995,Getting my ipad2 #sxsw (@mention Apple Store w/ 4 others) {link},iPad,Positive emotion,[],[#sxsw],True,False,Getting my ipad2 (@mention Apple Store w/ others)
9017,Second day using my Apple iPad2 at #SXSW and I'm really impressed. The magnetic cover is pure brilliance. Using a laptop is so old school.,iPad,Positive emotion,[],[#SXSW],False,False,Second day using my Apple iPad2 at and I'm really impressed. The magnetic cover is pure brilliance. Using a laptop is so old school.
9057,&quot;Do you know what Apple is really good at? Making you feel bad about your Xmas present!&quot; - Seth Meyers on iPad2 #sxsw #doyoureallyneedthat?,,I can't tell,[],"[#doyoureallyneedthat, #sxsw]",False,False,"""Do you know what Apple is really good at? Making you feel bad about your Xmas present!"" - Seth Meyers on iPad2 ?"


In [206]:
doc = "Wave, buzz... RT @mention We interrupt your regularly scheduled #sxsw geek programming with big news {link} #google #circles"

# pattern=r"[\x01-\x1F]"
# pattern_hits = re.findall(pattern, doc)

# # replace the hits in the original doc string
# for hit in pattern_hits:
#     print(hit)
    
# replace all instances
doc = re.sub(r"(?:^|\s)(@[a-zA-Z0-9_-]+)", " ", doc, )
print(doc)

Wave, buzz... RT  We interrupt your regularly scheduled #sxsw geek programming with big news {link} #google #circles


# EXPLORE

## Create tokenized corpora for visualization

I'm going to write a function to remove the stopwords (and punctuation), as well as tokenize using lemmatization or not. This will allow me to use it ad-hoc to generate tokens for the entire corpus for EDA before modeling, and also use it in an sklearn pipeline so I can apply the same logic when pre-processing for modeling.

I concerned that some of the stop words in the default nltk list, such as negations (don't, won't, shouldn't, can't) might more commonly contribute to negative emotion than positive ones. I want to investigate the list of stopwords and check out what I might want to customize. I may try a few different iterations.

In [285]:
nltk_stopwords = stopwords.words('english')
nltk_stopwords.sort()
print(nltk_stopwords)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some',

In [282]:
# my much-pared-down stopwords list for testing
custom_stopwords = ['a',
'an',
'and',
'as',
'at',
'be',
'by',
'for',
'from',
'if',
'in',
'into',
'it',
"it's",
'its',
'itself',
'of',
'on',
'or',
'than',
'that',
"that'll",
'the',
'to']

In [300]:
# create full and custom punctuation list. Custom excludes ! and ?
punc = list(string.punctuation)
punc_custom = punc.copy()
punc_custom.remove('?')
punc_custom.remove('!')

print(punc)
print(punc_custom)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
['"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [290]:
# create a list of all the emotions
all_emotions = list(df['emotion'].value_counts().index)
all_emotions

['No emotion toward brand or product',
 'Positive emotion',
 'Negative emotion',
 "I can't tell"]

In [325]:
# running this outside the function to avoid re-instantiating it each time,
# which takes longer
nlp = spacy.load("en_core_web_sm")

In [403]:
def spacy_tokenizer(doc, stop_list=None, lemmatize=True, orig_pronouns=True):
    """
    Tokenizes a string of text (document) with optional stop word removal
    and lemmatization from SpaCy. Add punctuation to `stop_list` to remove it 
    too, otherwise it will be retained as separate words based on how SpaCy
    parses it (some may remain attached to words).
    
    Always lowercases text.
    
    Returns a list of tokenized words, lowercased, lemmatized and with stopwords
    removed as indicated.
    
    ******
    Arguments:
    ******
    
    doc: String representing a document to be tokenized.
    
    stop_list: List of stop words to remove, or `None` to not remove any words.
    Add punctuation to this list to remove it.
    
    lemmatize: Boolean, default True. Set to False to use original versions of
    words instead of lemmas.
    
    orig_pronouns: Boolean, default True. Only applicable if you are lemmatizing.
    If True, use the original version of personal pronouns such as "I", "you", 
    "your" instead of the `-PRON-` placeholder that SpaCy adds to standardize 
    these. If False, include the `-PRON-` placeholder as the literal text token.
    """
    
    # use SpaCy to tokenize the doc
    tokens = nlp(doc)
    
    new_tokens = []
    
    for token in tokens:
        new_token = ''
        
        # determine appropriate version of token to use
        if lemmatize:
            
            if token.lemma_ == '-PRON-':
                if orig_pronouns:
                    new_token = str.lower(token.text)
                else:
                    new_token = str.lower(token.lemma_)
            else:
                new_token = str.lower(token.lemma_)
        else:
            new_token = str.lower(token.text)
        
        # check token against stop word list, if using
        if not stop_list == None:
            if new_token not in stop_list:
                new_tokens.append(new_token)
            
        else:
            new_tokens.append(new_token)         

    return new_tokens

In [319]:
def tokenized_corpus_dict(df, target_vals, stop_list, lemmatize, 
                          orig_pronouns, verbose=True):
    """
    """

    # generate corpus for each emotion
    corpus_per_target = {}

    for val in target_vals:
        
        if verbose:
            print(f"Starting target val: {val}")

        # get series of text docs per target_val
        docs = df.loc[df['emotion']==val, 'cleaned']

        # loop through docs and tokenize each one
        corpus = []

        i = 0
        for doc in docs:
            tokens = spacy_tokenizer(doc, stop_list=stop_list, 
                              lemmatize=lemmatize, orig_pronouns=True)
            # remove words if they're just spaces!
            tokens.remove(' ') if ' ' in tokens else None
            
            corpus.extend(tokens)
            i += 1
            
            if verbose and (i % 1000 == 0):
                print(f"Processed {i} docs out of {len(docs)}...")

        # add corpus to dict
        corpus_per_target[val] = corpus
        
    if verbose:
        print(f"Done!")
    return corpus_per_target


In [320]:
# remove no stopwords except most punctuation, don't lemmatize
min_processing = tokenized_corpus_dict(df, all_emotions, stop_list=punc_custom, 
                        lemmatize=False, orig_pronouns=True)

Starting target val: No emotion toward brand or product
Processed 1000 docs out of 5388...
Processed 2000 docs out of 5388...
Processed 3000 docs out of 5388...
Processed 4000 docs out of 5388...
Processed 5000 docs out of 5388...
Starting target val: Positive emotion
Processed 1000 docs out of 2978...
Processed 2000 docs out of 2978...
Starting target val: Negative emotion
Starting target val: I can't tell
Done!


Let's check out the results and make sure things look good. I had to go back and add a line to remove words that are just spaces.

In [316]:
df.loc[df['emotion']=="No emotion toward brand or product", 'cleaned'][:5].values

array([' New iPad Apps For And Communication Are Showcased At The Conference ',
       'Holler Gram for iPad on the iTunes App Store - (via ) ',
       'Attn: All frineds, Register for and see Cobra iRadar for Android. ',
       'Anyone at want to sell their old iPad?',
       'Anyone at who bought the new iPad want to sell their older iPad to me?'],
      dtype=object)

In [322]:
print(min_processing["No emotion toward brand or product"][:100])

['new', 'ipad', 'apps', 'for', 'and', 'communication', 'are', 'showcased', 'at', 'the', 'conference', 'holler', 'gram', 'for', 'ipad', 'on', 'the', 'itunes', 'app', 'store', 'via', 'attn', 'all', 'frineds', 'register', 'for', 'and', 'see', 'cobra', 'iradar', 'for', 'android', 'anyone', 'at', 'want', 'to', 'sell', 'their', 'old', 'ipad', '?', 'anyone', 'at', 'who', 'bought', 'the', 'new', 'ipad', 'want', 'to', 'sell', 'their', 'older', 'ipad', 'to', 'me', '?', 'at', 'oooh', 'google', 'to', 'launch', 'major', 'new', 'social', 'network', 'called', 'circles', 'possibly', 'today', 'spin', 'play', 'a', 'new', 'concept', 'in', 'music', 'discovery', 'for', 'your', 'ipad', 'from', 'spin.com', 'vatornews', 'google', 'and', 'apple', 'force', 'print', 'media', 'to', 'evolve', '?', 'hootsuite', 'hootsuite', 'mobile', 'for', 'updates', 'for', 'iphone']


In [324]:
# check out most common words
pos_min = FreqDist(min_processing['Positive emotion'])


freq_df = pd.DataFrame(pos_min.most_common(50),columns=['Word','Count'])
freq_df

Unnamed: 0,Word,Count
0,the,1597
1,!,1250
2,to,1161
3,at,1013
4,ipad,932
5,for,909
6,a,789
7,apple,755
8,google,663
9,is,654


In this minimally processed, tokenized corpus, I expect to still see stop words, ? and !, but definitely looking for other words or symbols that stand out, which I want to add to the stop words list.

Product names such as `apple`, `ipad`, `android`, and `google` stand out, as well as `austin` and `sxsw`, since I think these tweets were taken from a set where the SXSW festival in Austin was tagged. 

Also the ellipsis, which I realize wasn't in the punctuation list.

I'm going to add the ellipsis to the punctuation lists, and create another stop words list for products that I can test removing or leaving in.

In [336]:
# add ellipsis to punctuatino list to be excluded. I actually think it's
# being processed as a word by SpaCy, but since I'm removing both punctuation 
# and stop words at once, it shouldn't matter which list I use
punc.append("...")
punc_custom.append("...")

# create additional stopword lists related to the specific product and event
# so they can be removed separately to test results
product_stopwords = ['ipad', 'apple', 'google', 'iphone', 'android', 'ipad2']
event_stopwords = ['austin', 'sxsw']


In [375]:
# Let's try this again, with the updates! I'm just going to use the updated
# punc lists on this minimally processed version

min_processing = tokenized_corpus_dict(df, all_emotions, stop_list=punc_custom, 
                        lemmatize=False, orig_pronouns=True)

Starting target val: No emotion toward brand or product
Processed 1000 docs out of 5388...
Processed 2000 docs out of 5388...
Processed 3000 docs out of 5388...
Processed 4000 docs out of 5388...
Processed 5000 docs out of 5388...
Starting target val: Positive emotion
Processed 1000 docs out of 2978...
Processed 2000 docs out of 2978...
Starting target val: Negative emotion
Starting target val: I can't tell
Done!


In [406]:
# remove pared down and event stopwords as well as most punctuation, 
# lemmatize, and use '-pron-' placeholder
med_processing = tokenized_corpus_dict(df, all_emotions, 
            stop_list=custom_stopwords + punc_custom + event_stopwords, 
            lemmatize=True, orig_pronouns=False)

Starting target val: No emotion toward brand or product
Processed 1000 docs out of 5388...
Processed 2000 docs out of 5388...
Processed 3000 docs out of 5388...
Processed 4000 docs out of 5388...
Processed 5000 docs out of 5388...
Starting target val: Positive emotion
Processed 1000 docs out of 2978...
Processed 2000 docs out of 2978...
Starting target val: Negative emotion
Starting target val: I can't tell
Done!


In [407]:
# remove full nltk stopword list, event and product stopwords and all 
# punctuation, lemmatize, and use SpaCy's pronoun placeholder instead of
# the original text
max_processing = tokenized_corpus_dict(df, all_emotions, 
    stop_list=nltk_stopwords + punc + event_stopwords + product_stopwords, 
    lemmatize=True, orig_pronouns=True)

Starting target val: No emotion toward brand or product
Processed 1000 docs out of 5388...
Processed 2000 docs out of 5388...
Processed 3000 docs out of 5388...
Processed 4000 docs out of 5388...
Processed 5000 docs out of 5388...
Starting target val: Positive emotion
Processed 1000 docs out of 2978...
Processed 2000 docs out of 2978...
Starting target val: Negative emotion
Starting target val: I can't tell
Done!


In [408]:
# check out most common words from positive, medium-processed corpus
pos_med = FreqDist(med_processing['Positive emotion'])

freq_df = pd.DataFrame(pos_med.most_common(50),columns=['Word','Count'])
freq_df

Unnamed: 0,Word,Count
0,!,1250
1,ipad,965
2,apple,756
3,google,663
4,i,633
5,store,558
6,up,464
7,iphone,463
8,app,442
9,have,392


In [409]:
# check out most common words from positive, max-processed corpus
pos_max = FreqDist(max_processing['Positive emotion'])

freq_df = pd.DataFrame(pos_max.most_common(50),columns=['Word','Count'])
freq_df

Unnamed: 0,Word,Count
0,store,558
1,app,442
2,new,360
3,get,278
4,'s,255
5,pop,217
6,go,199
7,launch,190
8,open,170
9,one,151


In [410]:
# check out most common words from positive, max-processed corpus
neg_max = FreqDist(max_processing['Negative emotion'])

freq_df = pd.DataFrame(neg_max.most_common(50),columns=['Word','Count'])
freq_df

Unnamed: 0,Word,Count
0,app,85
1,store,47
2,new,45
3,like,43
4,get,42
5,'s,38
6,need,35
7,go,33
8,launch,31
9,design,30


In [None]:
fig, ax = plt.subplots(figsize=(10, 8))


# MODEL

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

## Preprocessing for modeling

A few of the options I'd like to try:

Different versions of stopwords and punctuation removal. All stopwords from the default NLTK list, which I think may contain some words that will be useful, and then also my customized, pared down stopwords list.

I'd like to keep contractions by default, but experiment with removing other punctuation. I'd like to try removing everything, and also keeping punctuation marks and question marks.

Try stemming or lemmatization. Ideally I'd like a way to expand contractions, but I'm not sure how feasible that will be.

I want to try a few different ways to vectorize. Regular `CountVectorizer` with the actual counts, TFIdF normalized, and als maybe a binary count. The [sklearn documentation mentions](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):
>"...very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable."

Also, probably unigrams or bigrams.

In [594]:
df['emotion'].value_counts()

No emotion toward brand or product    5388
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: emotion, dtype: int64

In [598]:
X = df.loc[df['emotion'].isin(['Positive emotion', 'Negative emotion']), 
           'cleaned']
y = df.loc[df['emotion'].isin(['Positive emotion', 'Negative emotion']), 
           'emotion']

In [600]:
y = y.map(lambda x: 1 if x=="Positive emotion" else 0)
y.value_counts()

1    2978
0     570
Name: emotion, dtype: int64

In [604]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(len(X_train))
print(len(y_train))
print(len(X_test))
print(len(y_test))

2838
2838
710
710


In [607]:
y_train.value_counts(normalize=True)

1    0.840733
0    0.159267
Name: emotion, dtype: float64

In [608]:
y_test.value_counts(normalize=True)

1    0.833803
0    0.166197
Name: emotion, dtype: float64

I'm not going to worry about stop words removal, since I plan to use TfIdf in my document matrix.

In [611]:
vectorizer = TfidfVectorizer(lowercase=True, analyzer='word', 
                             ngram_range=(1, 2))

vectorizer.fit(X_train)
X_train_tfidf = vectorizer.transform(X_train)
X_train_df = pd.DataFrame(X_train_tfidf.todense(), 
                          columns=vectorizer.get_feature_names())
X_train_df

Unnamed: 0,000,000 downloads,000 louis,000 sq,000 square,000 to,000 very,02,02 symbian,03,...,zombies what,zomg,zomg got,zomg its,zomg special,zoom,zoom in,zoom to,zzzs,zzzs iphone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2834,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2835,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2836,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# iNTERPRET

Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

# CONCLUSIONS & RECOMMENDATIONS

Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***

One big challenge with this dataset is how applicable it would be to other products, and also to different situations. I believe all of these tweets were tagged to the SWSW festival in Austin, so the content is specific to the activities and such that were there.

It's also pretty specific to Apple and Google products.