 
# <center> #DHBSI 2016: Computational Text Analysis </center>

## <center> Laura Nelson <br/> <em>Postdoctoral Fellow | Digital Humanities @ Berkeley | Berkeley Institute for Data Science </em> </center>

## <center> Teddy Roland <br/> <em> Coordinator, Digital Humanities @ Berkeley <br/> Lecturer, UC Berkeley </em> </center>

# <center> Summary </center>
## <center> Text Analysis Demystified </center>
### <center> It's Just Counting! <br/> </center>
![Counting](Text_Counting.jpg)

## <center> The Dark Side of DH: An Invitation
![Dark Side](Dark_Side.jpg)

## <center> Text Analysis in Research </center>
![Interpretive Moments](Text_Analysis_In_Reearch.jpg)

## <center> Lessons </center>
### <center> Our workshop included 5 days and 7 lessons to learn how counting, sometimes creative counting, can amplify and augment close readings of text </center>

In [None]:
##Lesson 1: Introduction to Natural Language Processing

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
import string
punctuations = list(string.punctuation)

#read the two text files from your hard drive, assign first mystery text to variable 'text1' and second mystery text to variable 'text2'
text1 = open('../01-Intro-to-NLP/text1.txt').read()
text2 = open('../01-Intro-to-NLP/text2.txt').read()

###word frequencies

#tokenize texts
text1_tokens = word_tokenize(text1)
text2_tokens = word_tokenize(text2)

#pre-process for word frequency
#lowercase
text1_tokens_lc = [word.lower() for word in text1_tokens]
text2_tokens_lc = [word.lower() for word in text2_tokens]

#remove stopwords
text1_tokens_clean = [word for word in text1_tokens_lc if word not in stopwords.words('english')]
text2_tokens_clean = [word for word in text2_tokens_lc if word not in stopwords.words('english')]

#remove punctuation using the list of punctuation from the string pacage
text1_tokens_clean = [word for word in text1_tokens_clean if word not in punctuations]
text2_tokens_clean = [word for word in text2_tokens_clean if word not in punctuations]

#frequency distribution
text1_word_frequency = nltk.FreqDist(text1_tokens_clean)
text2_word_frequency = nltk.FreqDist(text2_tokens_clean)



print("Frequent Words for Text1")
print("________________________")
for word in text1_word_frequency.most_common(20):
    print(word[0])
print()
print("Frequent Words for Text2")
print("________________________")
for word in text2_word_frequency.most_common(20):
    print(word[0])
    
    
###Can you guess the novel from most frequent words?

In [None]:
##Lesson 2: Basics of Python
###Nothing to see here, folks


##Lesson 3: Operationalizing
import pandas
dialogue_df = pandas.read_csv('../03-Operationalizing/antigone_dialogue.csv', index_col=0)
dialogue_tokens = [character.split() for character in dialogue_df['DIALOGUE']]
dialogue_len = [len(tokens) for tokens in dialogue_tokens]
dialogue_df['WORDS_SPOKEN'] = dialogue_len
dialogue_df = dialogue_df.sort_values('WORDS_SPOKEN', ascending = False)
# Let's visualize!

# Tells Jupyter to produce images in notebook
% pylab inline

# Makes images look good
style.use('ggplot')
dialogue_df['WORDS_SPOKEN'].plot(kind='bar')

###Who is the main protagonist? Maybe not Antigone?

In [None]:
# Lesson 4: Discriminating Words

from sklearn.feature_extraction.text import TfidfVectorizer

df = pandas.read_csv("../04-Discriminating-Words/BDHSI2016_music_reviews.csv", sep = '\t')

tfidfvec = TfidfVectorizer()
#create the dtm, but with cells weigthed by the tf-idf score.
dtm_tfidf_df = pandas.DataFrame(tfidfvec.fit_transform(df.body).toarray(), columns=tfidfvec.get_feature_names(), index = df.index)

df_genre = df['genre'].to_frame()
merged_df = df_genre.join(dtm_tfidf_df, how = 'right', lsuffix='_x')

#pull out the reviews for three genres, Rap, Alternative/Indie Rock, and Jazz
dtm_rap = merged_df[merged_df['genre_x']=="Rap"]
dtm_indie = merged_df[merged_df['genre_x']=="Alternative/Indie Rock"]
dtm_jazz = merged_df[merged_df['genre_x']=="Jazz"]

#print the words with the highest tf-idf scores for each genre
print("Rap Words")
print(dtm_rap.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Indie Words")
print(dtm_indie.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Jazz Words")
print(dtm_jazz.max(numeric_only=True).sort_values(ascending=False)[0:20])

###What words are distinct to reviews of Rap albums, Indie albums, and Jazz albums?
##Notice the word weights for the Rap albums compared to others. Are these reviews more different than other reviews?

In [None]:
# Lesson 5: Sentiment Analysis using the Dictionary Method

pos_sent = open("../05-Dictionary-Method/positive_words.txt").read()
neg_sent = open("../05-Dictionary-Method/negative_words.txt").read()

positive_words=pos_sent.split('\n')
negative_words=neg_sent.split('\n')

text1_pos = [word for word in text1_tokens_clean if word in positive_words]
text2_pos = [word for word in text2_tokens_clean if word in positive_words]

text1_neg = [word for word in text1_tokens if word in negative_words]
text2_neg = [word for word in text2_tokens if word in negative_words]

print("Postive words in Melville")
print(len(text1_pos)/len(text1_tokens))
print()
print("Negative words in Melville")
print(len(text1_neg)/len(text1_tokens))
print()
print("Postive words in Austen")
print(len(text2_pos)/len(text2_tokens))
print()
print("Negative words in Austen")
print(len(text2_neg)/len(text2_tokens))

###Who is more postive, Melville or Austen?
##Melville has a similar precentage of postive and negative words (a whale is a whale, neither good nor bad)
##Austen is decidedly more positive than negative (it's the gentleman thing to do)

In [None]:
#Lesson 6: Literary Distinction

from sklearn.naive_bayes import MultinomialNB




In [None]:
#Lesson 6: Topic Modeling