## Intro to text analysis in Python

This section will provide a short overview of text analysis in Python. For an in-depth tretment of the topic please attend the workshop on Friday. The NLTK examples are based on "Natural Language Processing with Python" Bird, Klein and Loper, 2010. We will first have to install NLTK in Anaconda prompt and then import it. 



## Working with strings

In [None]:
#Strings
monty = "Monty Python's " \
         "Flying Circus. "
print(monty*2 + " plus just the last word: " + monty[-8:])
print(monty.find('Python')) #finds position of substring within string
print(monty.upper() +' and '+ monty.lower())
print(monty.replace('y', 'x'))

In [None]:
# Joining and splitting strings to/from lists
' '.join(['Monty', 'Python'])


In [None]:
'Monty Python'.split()

In [None]:
#Regular exppressions 
import re
word = 'supercalifragilisticexpialidocious'
#Example: find and count all vowels.  
len(re.findall(r'[aeiou]', word))

## Common cleaning tasks

In [None]:
raw="The British people are telling us to “just get on with it”. Really? Where’s your evidence for that? \
Or is it wishful thinking? Coz I’m seeing Revoke, Revoke, Revoke - largest petition in history, in street stalls... \
and as the most popular options in the polls. #Brexit \
To those who are pretending that the withdrawal agreement is Brexit, it is clear you are either mistaken, \
deluded or dishonest.\n\nWe need to build trust in British politics.\n\n The WA will do no such thing. \
Latest CER estimate: the UK economy is 2.5 per cent smaller than it would be if Britain had voted remain. \
The knock-on hit to the public finances is £19 billion per annum – or £360 million a week."

In [None]:
import nltk
#download data for the examples
nltk.download() #please mind the window that opens

In [None]:
nltk.download('punkt')
#Tokenizing - divide into tokens
tokens = nltk.word_tokenize(raw)
tokens[:10]

In [None]:
#Normalizing - turn to lower case
lc_tokens=[w.lower() for w in tokens]
lc_tokens[:10]

In [None]:
# Removing stopwords and keeping only alphanumerics
from nltk.corpus import stopwords
stop_words=stopwords.words('english')
content = [w for w in lc_tokens if w.lower() not in stop_words and w.isalnum()]
content[:10]

In [None]:
porter = nltk.PorterStemmer()
porter_stemmed=[porter.stem(t) for t in content]
porter_stemmed[:10]

In [None]:
# Stemming - strip off affixes
lancaster = nltk.LancasterStemmer()
lan_stemmed=[lancaster.stem(t) for t in content]
lan_stemmed[:10]

In [None]:
nltk.download('wordnet')
# Lemmatizing - the word is from a dictionary
wnl = nltk.WordNetLemmatizer()
lematized=[wnl.lemmatize(t) for t in content]
lematized[:10]

In [None]:
# Sentence segmentation
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
sents = sent_tokenizer.tokenize(raw)
print(sents)

In [None]:
# Operating on every element. List comprehension.  
for word in tokens[0:5]:
    if len(word)<=5 and word.endswith('e'):
        print(word, ' is short and ends with e')
    elif word.istitle():
        print(word, ' is a titlecase word')
    else:
        print(word, 'is just another word')

In [None]:
#Explore how the words America and citizen are used in presidential inaugural speeches over time.
import nltk
nltk.download('inaugural')
cfd = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids()
    for w in inaugural.words(fileid)
    for target in ['america', 'war']
    if w.lower().startswith(target))
cfd.plot()

In [None]:
from nltk import FreqDist
verbs=["should", "may", "can"]
genres=["news", "government", "romance"]
for g in genres:
    words=brown.words(categories=g)
    freq=FreqDist([w.lower() for w in words if w.lower() in verbs])
    print(g, freq)

In [None]:
from nltk import FreqDist
#finding words that characterize a text, relatively long, and occur frequently
fdist = FreqDist(text4)
sorted([w for w in set(text4) if len(w) > 5 and fdist[w] > 100])

In [None]:
# count how often a word occurs in a text,
text4.count("democracy")
# compute what percentage of the text is taken up by a specific word
100 * text4.count('democracy') / len(text4)

In [None]:
#Location of a word in the text: how many spaces from the beginning does it appear? 
#This positional information can be displayed using a dispersion plot. 
#You need NumPy and Matplotlib.
nltk.download('treebank')
from nltk.book import text4
text4.dispersion_plot(["citizens", "democracy", "freedom", "war", "America", "vote"])

In [None]:
#Words in context
text4.concordance("vote")

In [None]:
#What other words appear in a similar range of contexts? 
text4.similar("vote")

## Processing collected tweets
Next, we need to process the json file, extract relevant fields and put them in a database for coding. The file is very large, so we won't process it live because it would take to long. Let's have a look at what it looks like and identify the relevant fields.

### Things to consider:

* Which fields are the relevant ones? 

* How do we make sure we have the full text of the tweet? 

* What should we do about retweets?

* How do we remove extra white spaces from the text? 

* What other cleaning do you think we'd need to do? 

* If we wanted to code some tweets, how do we draw a sub-sample of tweets for coding? 

In [None]:
#import csv
#total_tweets=0
#retweets=0
#with open('brexit_tweets_selected_fields_a.csv', 'w') as outfile:    
#    writer = csv.writer(outfile, delimiter=',', lineterminator='\n')
#    header = ['tweet_id', 'user_id', 'screen_name', 'followers_count', 'friends_count', 'statuses_count', 'tweet_date', 'retweeted_status_id', 'retweet_count', 'text']  
#    writer.writerow(header)    
#    with open("brexit_tweets.json", "rb") as json_file:
#        for line in json_file:
#            i=json.loads(line.decode('utf8'))
#            #total_tweets=total_tweets+1
#            date=str(i["created_at"])
#            try:
#                rt_id=i["retweeted_status"]["id_str"]
#                text=i["retweeted_status"]["full_text"].replace('\n', ' ').replace('\r', ' ').replace('\t', ' ').strip()
#                #text_str=text.replace('\n', ' ').replace('\r', ' ').replace('\t', ' ').strip()
#                retweets=retweets+1
#                total_tweets=total_tweets+1
#            except:
#                rt_id="none"
#                text=i["full_text"].replace('\n', ' ').replace('\r', ' ').replace('\t', ' ').strip()
#                total_tweets=total_tweets+1
#                #text_str=text.replace('\n', ' ').replace('\r', ' ').replace('\t', ' ').strip()
#            #print(text)
#            writer.writerow([i["id"]]+[i["user"]["id"]]+[i["user"]["screen_name"]]+[i["user"]["followers_count"]]+[i["user"]["friends_count"]]+[i["user"]["statuses_count"]]+[date]+[rt_id]+[i["retweet_count"]]+[text.encode('utf-8')])
     