# Section 43: Foundations of Natural Language Processing

## Learning Objectives

- Discuss word Embeddings and their advantages
- Training Word2Vec models
- Using pretrained word embeddings


- Create a Classification Model for true-trump ("Twitter for Android") vs trump-staffer("Twitter for iPhone - from period of time when android was still in use)

    - Use lesson's W2Vec class in Sci-kit learn models
    - Use LSTMs
    - Use RNN/GRUs






- Compare:
    1.  Mean embeddings vs count/tfidf data with scikit learn.
    

## NLP & Word Vectorization

> **_Natural Language Processing_**, or **_NLP_**, is the study of how computers can interact with humans through the use of human language.  Although this is a field that is quite important to Data Scientists, it does not belong to Data Science alone.  NLP has been around for quite a while, and sits at the intersection of *Computer Science*, *Artificial Intelligence*, *Linguistics*, and *Information Theory*. 

## Feature Engineering for Text Data


* Do we remove stop words or not?    
* Do we stem or lemmatize our text data, or leave the words as is?   
* Is basic tokenization enough, or do we need to support special edge cases through the use of regex?  
* Do we use the entire vocabulary, or just limit the model to a subset of the most frequently used words? If so, how many?  
* Do we engineer other features, such as bigrams, or POS tags, or Mutual Information Scores?   
* What sort of vectorization should we use in our model? Boolean Vectorization? Count Vectorization? TF-IDF? More advanced vectorization strategies such as Word2Vec?  


In [1]:
!pip install -U fsds_100719
from fsds_100719.imports import *

fsds_1007219  v0.7.6 loaded.  Read the docs: https://fsds.readthedocs.io/en/latest/ 


Handle,Package,Description
dp,IPython.display,Display modules with helpful display and clearing commands.
fs,fsds_100719,Custom data science bootcamp student package
mpl,matplotlib,Matplotlib's base OOP module with formatting artists
plt,matplotlib.pyplot,Matplotlib's matlab-like plotting module
np,numpy,scientific computing with Python
pd,pandas,High performance data structures and tools
sns,seaborn,High-level data visualization library based on matplotlib


['[i] Pandas .iplot() method activated.']


In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/jirvingphd/capstone-project-using-trumps-tweets-to-predict-stock-market/master/data/trump_tweets_12012016_to_01012020.csv')
df['datetime'] = pd.to_datetime(df['created_at'])
df = df.set_index('datetime').sort_index()
df

Unnamed: 0_level_0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-12-01 14:37:57,Twitter for iPhone,My thoughts and prayers are with those affecte...,12-01-2016 14:37:57,12077,65724,False,804333718999539712
2016-12-01 14:38:09,Twitter for Android,Getting ready to leave for the Great State of ...,12-01-2016 14:38:09,9834,57249,False,804333771021570048
2016-12-01 22:52:10,Twitter for iPhone,Heading to U.S. Bank Arena in Cincinnati Ohio ...,12-01-2016 22:52:10,5564,31256,False,804458095569158144
2016-12-02 02:45:18,Twitter for iPhone,Thank you Ohio! Together we made history – and...,12-02-2016 02:45:18,17283,72196,False,804516764562374656
2016-12-03 00:44:20,Twitter for Android,The President of Taiwan CALLED ME today to wis...,12-03-2016 00:44:20,24700,111106,False,804848711599882240
...,...,...,...,...,...,...,...
2020-01-01 01:17:43,Twitter for iPhone,RT @SenJohnKennedy: I think Speaker Pelosi is ...,01-01-2020 01:17:43,8893,0,True,1212181071988703232
2020-01-01 01:18:47,Twitter for iPhone,RT @DanScavino: https://t.co/CJRPySkF1Z,01-01-2020 01:18:47,10796,0,True,1212181341078458369
2020-01-01 01:22:28,Twitter for iPhone,Our fantastic First Lady! https://t.co/6iswto4WDI,01-01-2020 01:22:28,27567,132633,False,1212182267113680896
2020-01-01 01:30:35,Twitter for iPhone,HAPPY NEW YEAR!,01-01-2020 01:30:35,85409,576045,False,1212184310389850119


### Making iPhone vs Twitter df

In [3]:
devices = ['Twitter for Android','Twitter for iPhone']

In [4]:
print(f'The first and last timestamps for the group {devices[0]}are:')
start,end= df.groupby('source').get_group(devices[0]).index[[0,-1]]
print(start) ,print(end)

The first and last timestamps for the group Twitter for Androidare:
2016-12-01 14:38:09
2017-03-25 14:41:14


(None, None)

In [5]:
## Puttig it all toether
df_data = df[df['source'].isin(devices)].loc[start:end]
df_data['source'].value_counts(normalize=True)

Twitter for Android    0.603648
Twitter for iPhone     0.396352
Name: source, dtype: float64

In [6]:
df.to_csv('datasets/trump_tweets_iphone_vs_twitter.csv',index=False)


# Word Embeddings

- Train word embeddings on miminally processed text.

In [7]:
from nltk import word_tokenize
data = df['text'].map(word_tokenize)
data_lower = list(map(lambda x: [w.lower() for w in x],data))

In [8]:
data[2],data_lower[2]

(['Heading',
  'to',
  'U.S.',
  'Bank',
  'Arena',
  'in',
  'Cincinnati',
  'Ohio',
  'for',
  'a',
  '7pm',
  'rally',
  '.',
  'Join',
  'me',
  '!',
  'Tickets',
  ':',
  'https',
  ':',
  '//t.co/HiWqZvHv6M'],
 ['heading',
  'to',
  'u.s.',
  'bank',
  'arena',
  'in',
  'cincinnati',
  'ohio',
  'for',
  'a',
  '7pm',
  'rally',
  '.',
  'join',
  'me',
  '!',
  'tickets',
  ':',
  'https',
  ':',
  '//t.co/hiwqzvhv6m'])

- Train Word2Vec on minimally processed text.
    - Try only removing uppercase?

## Training Word2Vec

In [9]:
from gensim.models import Word2Vec
TEXT_SOURCE = data_lower
VECTOR_SIZE = 100
model = Word2Vec(TEXT_SOURCE,size=VECTOR_SIZE,window=4,min_count=1,workers=4)
model.train(data_lower, total_examples=model.corpus_count,epochs=10)

(3327046, 4413210)

In [10]:
wv = model.wv

wv.most_similar('clinton')

[('hillary', 0.8966301679611206),
 ('crooked', 0.8911771178245544),
 ('h', 0.8010057210922241),
 ('fbi', 0.7868572473526001),
 ('dnc', 0.7865959405899048),
 ('emails', 0.7657970190048218),
 ('cl…', 0.7605175375938416),
 ('strzok', 0.759728193283081),
 ('deleted', 0.7320792078971863),
 ('comey', 0.7287471294403076)]

In [11]:
wv.most_similar(positive=['america'])#,positive=['man'])

[('uncomfortable', 0.6394579410552979),
 ('usa', 0.6315405964851379),
 ('future', 0.6265136003494263),
 ('arrangements', 0.6233097314834595),
 ('//t.co/5volztaorm', 0.6066502332687378),
 ('morphing', 0.5912973880767822),
 ('american', 0.5843404531478882),
 ('america…', 0.5766141414642334),
 ('nation', 0.574530839920044),
 ('//t.co/5jxdojpzmn', 0.5607695579528809)]

### Math with Trumps Vectors

In [13]:
# ### USING WORD VECTOR MATH TO GET A FEEL FOR QUALITY OF MODE
# def get_vector(string):#,wv=wv):
#     return wv.get_vector(string)


# def get_similar(vector):#,wv=wv):
#     return wv.similar_by_vector(vector)


# def check_vocab(word):
#     return word in wv.vocab

def word_math(wv,pos_words=['hillary'],neg_words=['bill'],
              verbose=True,return_vec=False):
    if isinstance(pos_words,str):
        pos_words=[pos_words]
    if isinstance(neg_words,str):
        neg_words=[neg_words]
        


    pos_eqn = '+'.join(pos_words)
    neg_eqn = '-'.join(neg_words)

    print('---'*15)    
    print(f"[i] Result for:\t{pos_eqn}{' - '+neg_eqn if len(neg_eqn)>0 else ' '}")
    print('---'*15)

    answer = wv.most_similar(positive=pos_words,negative=neg_words)
    
    if verbose:
          [print(f"- {ans[0]} ({round(ans[1],3)})") for ans in answer]
          print('---'*15,'\n\n')

    if return_vec: 
          return answer
    

In [15]:
equation_list=[(['america','crime'],[]),
               
               (['democrats','russia'],[]),
               (['republican'],['honor']),
               (['man','power'],[]),
               (['russia','honor'],[]),
              (['china','tariff'])]

for eqn in equation_list:
#     print('\n\n')
    word_math(wv,*eqn)
#     word_math(wv2,*eqn)

---------------------------------------------
[i] Result for:	america+crime 
---------------------------------------------
- borders (0.705)
- morally (0.702)
- choice (0.678)
- r (0.636)
- strong (0.63)
- usa (0.615)
- future (0.601)
- tough (0.598)
- california (0.596)
- cuts (0.595)
--------------------------------------------- 


---------------------------------------------
[i] Result for:	democrats+russia 
---------------------------------------------
- dems (0.787)
- russians (0.755)
- corruption (0.705)
- russian (0.702)
- collusion (0.655)
- dnc (0.655)
- facts (0.651)
- hoax (0.64)
- they (0.636)
- answers (0.629)
--------------------------------------------- 


---------------------------------------------
[i] Result for:	republican - honor
---------------------------------------------
- democrat (0.679)
- thedemocrats (0.566)
- dem (0.506)
- republicans (0.5)
- party (0.497)
- dems (0.478)
- votes (0.478)
- liberal (0.475)
- democratic (0.471)
- //t.co/neavcugpzz (0.456)
--

## Embedding Layers
You should make note of a couple caveats that come with using embedding layers in your neural network -- namely:

* The embedding layer must always be the first layer of the network, meaning that it should immediately follow the `Input()` layer 
* All words in the text should be integer-encoded, with each unique word encoded as it's own unique integer  
* The size of the embedding layer must always be greater than the total vocabulary size of the dataset! The first parameter denotes the vocabulary size, while the second denotes the size of the actual word vectors
* The size of the sequences passed in as data must be set when creating the layer (all data will be converted to padded sequences of the same size during the preprocessing step) 


[Keras Documentation for Embedding Layers](https://keras.io/layers/embeddings/).

### Creating Mean Embeddings

In [None]:
from sklearn.model_selection import train_test_split
from nltk import word_tokenize
data = df['text'].map(word_tokenize)
X = list(map(lambda x: [w.lower() for w in x],data))
y = df['source']

X_idx = list(range(X.shape[0]))
train_idx,test_idx = train_test_split(X_idx,random_state=123)


def train_test_split_idx(X, y, train_idx,test_idx):
    # try count vectorized first
    X_train = X[train_idx].copy()
    y_train = y[train_idx].copy()
    X_test = X[train_idx].copy()
    y_test = y[train_idx].copy()
    return X_train, X_test,y_train, y_test

X_dict = {'count':X_tf,
         'tfidf':X_tfidf}

In [33]:
class W2vVectorizer(object):
    
    def __init__(self, w2v):
        # Takes in a dictionary of words and vectors as input
        self.w2v = w2v
        if len(w2v) == 0:
            self.dimensions = 0
        else:
            self.dimensions = len(w2v[next(iter(glove))])
    
    # Note: Even though it doesn't do anything, it's required that this object implement a fit method or else
    # it can't be used in a scikit-learn pipeline  
    def fit(self, X, y):
        return self
            
    def transform(self, X):
        return np.array([
            np.mean([self.w2v[w] for w in words if w in self.w2v]
                   or [np.zeros(self.dimensions)], axis=0) for words in X])

In [34]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

rf =  Pipeline([('Word2Vec Vectorizer', W2vVectorizer(glove)),
              ('Random Forest', RandomForestClassifier(n_estimators=100, verbose=True))])
svc = Pipeline([('Word2Vec Vectorizer', W2vVectorizer(glove)),
                ('Support Vector Machine', SVC())])
lr = Pipeline([('Word2Vec Vectorizer', W2vVectorizer(glove)),
              ('Logistic Regression', LogisticRegression())])

models = [('Random Forest', rf),
          ('Support Vector Machine', svc),
          ('Logistic Regression', lr)]
# models = {'Random Forest':RandomForestClassifier(n_estimators=100, verbose=True),
#           'SVC':SVC(),'lr':LogisticRegression()}

In [None]:
scores = [(name, cross_val_score(model, data, target, cv=2).mean()) for name, model, in models]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer,TfidfVectorizer
count_vectorizer = CountVectorizer()
tf_transformer = TfidfTransformer(use_idf=False)#TfidfTransformer()
tfidf_transformer = TfidfTransformer(use_idf=True)#TfidfTransformer()

In [None]:
from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
y = le.fit_transform(df['source'])
y.shape

In [None]:
X = count_vectorizer.fit_transform(df['text'])
X_tf = tf_transformer.fit_transform(X)
X_tfidf = tfidf_transformer.fit_transform(X)

In [None]:
X_tf.shape,X_tfidf.shape

In [None]:
res = [['Method','Model',"Result"]]

for tf_type,X_data in X_dict.items():
    X_train, X_test,y_train, y_test = train_test_split_idx(
        X_data,y,train_idx,test_idx)
    
    for name, model in models.items():
    
#     rf = RandomForestClassifier(n_estimators=100,verbose=True)
        cv_res = cross_val_score(model, X_train,y_train, cv=5)
        res.append([tf_type,name,cv_res.mean()])

pd.DataFrame(res[1:],columns=res[0]).sort_values("Result",ascending=False)

In [None]:


rf_params = dict(n_estimators=100, verbose=True)

rf =Pipeline([('Word2Vec Vectorizer', W2vVectorizer(glove)),
              ('Random Forest',RandomForestClassifier(**rf_params))])

svc = Pipeline([('Word2Vec Vectorizer', W2vVectorizer(glove)),
                ('Support Vector Machine', SVC())])

lr = Pipeline([('Word2Vec Vectorizer', W2vVectorizer(glove)),
              ('Logistic Regression', LogisticRegression())])

In [None]:
models = [('Random Forest', rf),
          ('Support Vector Machine', svc),
          ('Logistic Regression', lr)]

In [None]:
# res = [['Model','Score']]
res=[['Model','Scores']]
for (name, model) in models:
    print(name)
    cv_res = cross_val_score(model, data_lower, df['source'], cv=5).mean()
    res.append([name,cv_res])
    
pd.DataFrame(res[1:],columns=res[0])

## Comparing Trained vs Pre-Trained Models

In [28]:
import os
folder = '/Users/jamesirving/Datasets/'#glove.twitter.27B/'
glove_file = folder+'glove.6B/glove.6B.50d.txt'#'glove.twitter.27B.50d.txt'
glove_twitter_file = folder+'glove.twitter.27B/glove.twitter.27B.100d.txt'#'glove.twitter.27B/glove.twitter.27B.50d.txt'


### Keeping only the vectors needed

In [18]:
## This line of code for getting all words bugs me
total_vocabulary = set(word for tweet in data_lower for word in tweet)
len(total_vocabulary)

24051

In [29]:
glove = {}
with open(glove_twitter_file,'rb') as f:#'glove.6B.50d.txt', 'rb') as f:
    for line in f:
        parts = line.split()
        word = parts[0].decode('utf-8')
        if word in total_vocabulary:
            vector = np.array(parts[1:], dtype=np.float32)
            glove[word] = vector

In [None]:
# df['source'].value_counts()

In [None]:
# data = df['text'].map(word_tokenize)
# data_lower = list(map(lambda x: [w.lower() for w in x],data))

### Now With Neural Networks

In [None]:
from py_files import keras_gridsearch as kg
from sklearn import metrics

In [None]:
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, LSTM, Embedding
from keras.layers import Dropout, Activation, Bidirectional, GlobalMaxPool1D
from keras.models import Sequential
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.preprocessing import text, sequence
from keras.utils import to_categorical

In [None]:
y = pd.get_dummies(df['source'])
y_t=y.values

In [None]:
max(list(map(lambda x: len(x) ,sequences)))

In [None]:
MAX_WORDS = 20000
tokenizer = text.Tokenizer(num_words=MAX_WORDS)

tokenizer.fit_on_texts(data_lower)#df['text'])
sequences = tokenizer.texts_to_sequences(data_lower)#df['text'])

X_t = sequence.pad_sequences(sequences, maxlen=100)
X_t.shape

In [None]:
# from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X_t,y, test_size=.15)
# print(X_train.shape),print(y_test.shape)
X_train, X_test,y_train,y_test =train_test_split_idx(X_t,y_t,train_idx,test_idx)
X_train.shape,y_train.shape


In [None]:
EMBEDDING_SIZE = 128 #where codealong get this?
model=Sequential()
model.add(Embedding(MAX_WORDS,EMBEDDING_SIZE))
model.add(LSTM(25,return_sequences=True))
model.add(GlobalMaxPool1D())
model.add(Dropout(0.5))
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

model.compile(loss='binary_crossentropy',#'categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.summary()

history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

y_hat_test = model.predict_classes(X_test)
kg.evaluate_model(y_test,y_hat_test,history)

In [None]:
# metrics.plot_confusion_matrix(model_sk,X_test,y_test.values.argmax(axis=1))

# Practicing Text Preprocessing with Trump's Tweets

# LAST CLASS

### Removing Stopwords

In [None]:
## Make a list of stopwords to remove
from nltk.corpus import stopwords
import string

In [None]:
# Get all the stop words in the English language
stopwords_list = stopwords.words('english')
stopwords_list+=string.punctuation
print(stopwords_list)
stopwords_list.remove('until')
stopwords_list.extend(['“','...','”'])

In [None]:
## Commentary on not always accepting what is or isn't in stopwords
'until' in stopwords_list

In [None]:
stopped_tokens = [w.lower() for w in tokens if w.lower() not in stopwords_list]
freq = FreqDist(stopped_tokens)
freq.most_common(100)

In [None]:
from nltk import word_tokenize
from ipywidgets import interact

@interact
def tokenize_tweet(i=(0,len(corpus)-1)):
    from nltk.corpus import stopwords
    import string
    from nltk import word_tokenize,regexp_tokenize
    
    print(f"- Tweet #{i}:\n")
    print(corpus[i],'\n')
    tokens = word_tokenize(corpus[i])

    # Get all the stop words in the English language
    stopwords_list = stopwords.words('english')
    stopwords_list += string.punctuation
    stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
    
    print(tokens,end='\n\n')
    print(stopped_tokens)

In [None]:
## Get FreqDist for Cleaned Text Data
corpus[:20]

### Comparing Phases of Proprocessing/Tokenization

In [None]:
# def clean_text(text,exclude_words=['until']):
#     from nltk.corpus import stopwords
#     import string
#     from nltk import word_tokenize,regexp_tokenize
#     ## tokenize text
#     tokens = word_tokenize(text)
#     # Get all the stop words in the English language
#     stopwords_list = stopwords.words('english')
#     stopwords_list += string.punctuation
#     stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
#     return stopped_tokens

In [None]:
from nltk import word_tokenize
from ipywidgets import interact

@interact
def tokenize_tweet(i=(0,len(corpus)-1)):
    from nltk.corpus import stopwords
    import string
    from nltk import word_tokenize,regexp_tokenize
    
    print(f"- Tweet #{i}:\n")
    print(corpus[i],'\n')
    tokens = word_tokenize(corpus[i])

    # Get all the stop words in the English language
    stopwords_list = stopwords.words('english')
    stopwords_list += string.punctuation
    stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
    
    print(tokens,end='\n\n')
    print(stopped_tokens)

## Regular Expressions

- Best regexp resource and tester: https://regex101.com/

    - Make sure to check "Python" under Flavor menu on left side.

In [None]:
text =  corpus[6615]
text

In [None]:
text2=corpus[7347]
text2

In [None]:
from nltk import regexp_tokenize
pattern = r"([a-zA-Z]+(?:'[a-z]+)?)"
regexp_tokenize(text,pattern)

In [None]:
print('[i] Word Tokenize:',end='\n'+'---'*20+'\n')
print(word_tokenize(text))

print('\n[i] Regexp Tokenize:',end='\n'+'---'*20+'\n')
print(regexp_tokenize(text,pattern))

In [None]:
def clean_text(text,regex=True):
    from nltk.corpus import stopwords
    import string
    from nltk import word_tokenize,regexp_tokenize

    ## tokenize text
    if regex:
        pattern = r"([a-zA-Z]+(?:'[a-z]+)?)"
        tokens= regexp_tokenize(text,pattern)
    else:
        tokens = word_tokenize(text)
    # Get all the stop words in the English language
    stopwords_list = stopwords.words('english')
    stopwords_list += string.punctuation
    stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
    return stopped_tokens

In [None]:
# @interact
# def regexp_tokenize_tweet(i=(0,len(corpus)-1)):
#     print(f"- Tweet #{i}:\n")
#     print(corpus[i],'\n')
#     from nltk import regexp_tokenize
#     pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
#     tokens= regexp_tokenize(corpus[i],pattern)

#     # It is usually a good idea to lowercase all tokens during this step, as well
#     stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
#     print(tokens,end='\n\n')
#     return print(stopped_tokens)

In [None]:
import re

def find_urls(string): 
    return re.findall(r"(http[s]?://\w*\.\w*/+\w+)",string)

def find_hashtags(string):
    return re.findall(r'\#\w*',string)

def find_retweets(string):
    return re.findall(r'RT [@]?\w*:',string)

def find_mentions(string):
    return re.findall(r'\@\w*',string)

In [None]:
find_urls(text)

In [None]:
find_mentions(text2)

### Stemming/Lemmatization

In [None]:

from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize('feet')) # foot
print(lemmatizer.lemmatize('running')) # run [?!] Does not match expected output

In [None]:
text_in =  corpus[6615]

# # urls = find_urls(text)
# def clean_text(text,regex=True):
#     from nltk.corpus import stopwords
#     import string
#     from nltk import word_tokenize,regexp_tokenize

#     ## tokenize text
#     if regex:
#         pattern = r"([a-zA-Z]+(?:'[a-z]+)?)"
#         tokens= regexp_tokenize(text,pattern)
#     else:
#         tokens = word_tokenize(text)
#     # Get all the stop words in the English language
#     stopwords_list = stopwords.words('english')
#     stopwords_list += string.punctuation
#     stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
#     return stopped_tokens

def process_tweet(text,as_lemmas=False,as_tokens=True):
#     text=text.copy()
    for x in find_urls(text):
        text = text.replace(x,'')
        
    for x in find_retweets(text):
        text = text.replace(x,'')    
        
    for x in find_hashtags(text):
        text = text.replace(x,'')    

    if as_lemmas:
        from nltk.stem.wordnet import WordNetLemmatizer
        lemmatizer = WordNetLemmatizer()
        text = lemmatizer.lemmatize(text)
    
    if as_tokens:
        text = clean_text(text)
    
    if len(text)==0:
        text=''
            
    return text

In [None]:
@interact
def show_processed_text(i=(0,len(corpus)-1)):
    text_in = corpus[i]#.copy()
    print(text_in)
    text_out = process_tweet(text_in)
    print(text_out)
    text_out2 = process_tweet(text_in,as_lemmas=True)
    print(text_out2)

In [None]:
corpus[:6]

## Text Classification

> Potential Tasks: Classify Android vs iPhone tweets (from period where Android tweets still exist

In [None]:
df['datetime'] = pd.to_datetime(df['created_at'])
df

df = df.set_index('datetime').sort_index()
df

In [None]:
df['clean_text'] = df['text'].apply(process_tweet)
df

In [None]:
android = df.groupby('source').get_group('Twitter for Android')
android.index

iphone = df.groupby('source').get_group('Twitter for iPhone').loc[:android.index[-1]]
iphone

In [None]:
len(android), len(iphone)

In [None]:
df_corpus = pd.concat([iphone,android],axis=0)
df_corpus['source'].value_counts()

### Vectorization 

- Count vectorization
- Term Frequency-Inverse Document Frequency (TF-IDF)
    -  Used for multiple texts
    
    
**_Term Frequency_** is calculated with the following formula:

$$ \text{Term Frequency}(t) = \frac{\text{number of times it appears in a document}} {\text{total number of terms in the document}} $$ 

**_Inverse Document Frequency_** is calculated with the following formula:

$$ IDF(t) = log_e(\frac{\text{Total Number of Documents}}{\text{Number of Documents with it in it}})$$

The **_TF-IDF_** value for a given word in a given document is just found by multiplying the two!


## Questions/Topics 
- Next time: vectorization
- Vs Embeddings

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

In [None]:
vectorizer.fit_transform(df_corpus['clean_text'].values)