# NLTK & tweepy



In [None]:
This tutorial is partly based on [](https://medium.com/analytics-vidhya/twitter-sentiment-analysis-134553698978)

## Setup

In [24]:
import nltk
import pandas as pd

## Data

### Import

In [25]:
df = pd.read_csv("tweets.csv")

## Tokenization

Now, the first step is to remove the noisy data like punctuations, hashtags, @ and others that are not alphanumeric. Only alphanumeric data are meaningful data that can help us in identifying the sentiments. To remove the noisy data, we will import RegexpTokenizer which will split the strings into substrings based on a regular expression.

In [26]:
from nltk.tokenize import RegexpTokenizer

regexp = RegexpTokenizer(r'\w+')

df['text_token']=df['text'].apply(regexp.tokenize)


In [27]:
df['text_token']

0    [Wir, unterstützen, die, Impfkampagne, Zusamme...
1    [Studierende, der, HdM, haben, Ulrich, Land, u...
2    [Die, Hochschulen, der, Region, Stuttgart, lad...
3    [Seit, Oktober, 2021, ist, Prof, Dr, Bernd, Sc...
Name: text_token, dtype: object

### Stopwords

Now that we have a tokenized version of the alphanumeric data, our next step will be to remove all the common words which aren’t useful for sentiment analysis. Words like about, above, other punctuations, conjunctions, etc are used a lot in any text data but aren’t useful especially for our purpose. These words are called stopwords. We will now remove the stopwords and make our tweets cleaner for analysis.


In [28]:
from nltk.corpus import stopwords
#nltk.download(‘stopwords’)

# make a list of german stopwords
stopwords = nltk.corpus.stopwords.words("german")

In [29]:
#remove stopwords
df['text_token'] = df['text_token'].apply(lambda x: [item for item in x if item not in stopwords])

In [30]:
df['text_token']

0    [Wir, unterstützen, Impfkampagne, ZusammenGege...
1    [Studierende, HdM, Ulrich, Land, Jörg, Markste...
2    [Die, Hochschulen, Region, Stuttgart, laden, M...
3    [Seit, Oktober, 2021, Prof, Dr, Bernd, Schmid,...
Name: text_token, dtype: object

## Lowercase

Convert all the tokens into lowercase

## Remove uncommon words

After removing the stopwords, we will remove all the words that have a length <=2. In general, small words (length <=2 ) aren’t useful for sentiment analysis because they have no meaning. These most probably are noise in our analysis. Apart from removing small words, we will convert all the tokens into lowercase. This is because words like ‘apple’ or ‘Apple’ have the same meaning in the sentimental context.

## spaCy

- names: A list of common English names compiled by Mark Kantrowitz
- stopwords: A list of really common words, like articles, pronouns, prepositions, and conjunctions
- state_union: A sample of transcribed State of the Union addresses by different US presidents, compiled by Kathleen Ahrens
- twitter_samples: A list of social media phrases posted to Twitter
- movie_reviews: Two thousand movie reviews categorized by Bo Pang and Lillian Lee
- averaged_perceptron_tagger: A data model that NLTK uses to categorize words into their part of speech
- vader_lexicon: A scored list of words and jargon that NLTK references when performing sentiment analysis, created by C.J. Hutto and Eric Gilbert
- punkt: A data model created by Jan Strunk that NLTK uses to split full texts into word lists

In [2]:
# Download nltk 
nltk.download([
    "names",
    "stopwords",
    "state_union",
    "twitter_samples",
    "movie_reviews",
    "averaged_perceptron_tagger",
    "vader_lexicon",
    "punkt",
])

[nltk_data] Downloading package names to /Users/jankirenz/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jankirenz/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package state_union to
[nltk_data]     /Users/jankirenz/nltk_data...
[nltk_data]   Unzipping corpora/state_union.zip.
[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/jankirenz/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/jankirenz/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jankirenz/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/jankirenz/nltk_data...
[nltk_data] Downloading package punkt to /U

True

Start by loading the State of the Union corpus you downloaded earlier:

In [3]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]


In [9]:
words_t = [w for w in df['text'] if w.isalpha()]


In [11]:
words_t

[]

In [4]:
words

['PRESIDENT',
 'HARRY',
 'S',
 'TRUMAN',
 'S',
 'ADDRESS',
 'BEFORE',
 'A',
 'JOINT',
 'SESSION',
 'OF',
 'THE',
 'CONGRESS',
 'April',
 'Mr',
 'Speaker',
 'Mr',
 'President',
 'Members',
 'of',
 'the',
 'Congress',
 'It',
 'is',
 'with',
 'a',
 'heavy',
 'heart',
 'that',
 'I',
 'stand',
 'before',
 'you',
 'my',
 'friends',
 'and',
 'colleagues',
 'in',
 'the',
 'Congress',
 'of',
 'the',
 'United',
 'States',
 'Only',
 'yesterday',
 'we',
 'laid',
 'to',
 'rest',
 'the',
 'mortal',
 'remains',
 'of',
 'our',
 'beloved',
 'President',
 'Franklin',
 'Delano',
 'Roosevelt',
 'At',
 'a',
 'time',
 'like',
 'this',
 'words',
 'are',
 'inadequate',
 'The',
 'most',
 'eloquent',
 'tribute',
 'would',
 'be',
 'a',
 'reverent',
 'silence',
 'Yet',
 'in',
 'this',
 'decisive',
 'hour',
 'when',
 'world',
 'events',
 'are',
 'moving',
 'so',
 'rapidly',
 'our',
 'silence',
 'might',
 'be',
 'misunderstood',
 'and',
 'might',
 'give',
 'comfort',
 'to',
 'our',
 'enemies',
 'In',
 'His',
 'infi

Note that you build a list of individual words with the corpus’s .words() method, but you use str.isalpha() to include only the words that are made up of letters. Otherwise, your word list may end up with “words” that are only punctuation marks.