# Text analysis with Python

Gigantic amounts of text is created every day. With software, we can now make it data. 

YOU MUST DO THIS ON THE COMMAND LINE FIRST, BEFORE STARTING JUPYTER NOTEBOOK:

`pip install -U textblob`

Next you'll want to find that `unlcrime.csv` file and put it in the same place you have this notebook in. For most of you, that will be in your downloads folder. If it's aready there -- and it very well may be -- then you're good to go. 

After that, start Jupyter Notebook.

Our first step is always to import the libraries we need. In this case, we're importing textblob and csv, because we'll need it to read the data file.

In [1]:
import textblob, csv

Our first step to moving to meaningful data analysis is to get the csv file into a text file for us to use. First, we open the csv file, and we open a text file that we can write to. Then we loop through the csv, writing out the narratives for each incident in 2016 to that text file. Why only 2016? I tried doing this with all the years and it took forever on a reasonably decent machine. So this will give you a flavor of what can be done without killing your computer. 

In [7]:
unl = csv.reader(open("unlcrime.csv", "r"), dialect="excel")
file = open("unlnarratives.txt", "w") 

for row in unl:
    if row[11] == "2015":
        file.write(row[10]+"\n")
    
file.close()

Now, with a file, we can open it back up, read it and then turn it into something called a TextBlob, which the textblob library needs to work with. 

In [2]:
file = open("unlnarratives.txt", "r")
text = file.read()
blob = textblob.TextBlob(text)

Some of the most simple things we can do with text analysis is figure out what we have. For instance, we can get all nouns out of the incidents data like this. 

In [9]:
nouns = blob.noun_phrases

len(nouns)

7240

So there's 6,533 noun phrases in the data. But how many words?

In [10]:
words = blob.words

len(words)

40634

So in a year, UNLPD wrote 32,229 words about crimes, mostly in single sentence descriptions on what went on. 

### Your turn

Using [noun phrase and word count frequencies](https://textblob.readthedocs.io/en/dev/quickstart.html#get-word-and-noun-phrase-frequencies), what questions might you ask of this data? What would you want to know when it comes to words used by UNL police? How would you express that in code here? 

For the assignment, you must formulate a question that can be answered with noun phrase and word count frequencies. Then you must answer it with code. Turn in this notebook to the assignment. 

In [11]:
# How many times does drone appear? 

blob.word_counts['drone']

2

### Sentences and sentiment

Along with nouns and words, you can also get sentences. And with sentences, we can get to sentiment analysis. TextBlob has a built in sentiment classifier, and the ability to build your own. Let's test it out using the first 10 sentences of the UNLPD data. 

In [3]:
sentences = blob.sentences

In [4]:
tensentences = sentences[1:10]

In [5]:
for sentence in tensentences:
    print(sentence.sentiment)

Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.05)
Sentiment(polarity=0.0, subjectivity=0.5)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=-0.175, subjectivity=0.45)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.1, subjectivity=0.4)


In [6]:
for sentence in tensentences:
    print(sentence)

BrAC 0.222.
Malfunctioning steam pipe caused the Alarm in the building to go off.
Turned over to UNL maintenance.
A non-UNL affiliate was contacted on a traffic stop and subsequently cited and released for One headlight at night and Possess loaded shotgun on highway.
A non-UNL affiliate was cited and released for Fictitious Plates, Open Title, No Valid Registration, DUS, and No Proof of Insurance.
Vehicle was towed to Capitol Towing.
Officers responded to a general fire alarm at Theta Xi which was found to be caused by a pipe that had broken.
UNLPD Ofc.
assisted LPD with UNL video footage from a robbery which just occurred near East Campus.


### Your turn

Here's a textblob of the last 10 tweets from Donald Trump. Rate each for polarity -- -1 is perfectly negative and +1 is perfectly positive -- and on subjectivity, where 0 is objective and 1 is subjective. Then run textblob's sentiment analysis and compare yours to its. How did you do? What's the problem with all of this? 

In [9]:
tweets = textblob.TextBlob(
    "Big vote tomorrow in the House. Tax cuts are getting close! "
    "Why are Democrats fighting massive tax cuts for the middle class and business (jobs)? The reason: Obstruction and Delay! "
    "It is actually hard to believe how naive (or dumb) the Failing @nytimes is when it comes to foreign policy...weak and ineffective! "
    "...They should realize that these relationships are a good thing, not a bad thing. The U.S. is being respected again. Watch Trade! "
    "The failing @nytimes hates the fact that I have developed a great relationship with World leaders like Xi Jinping, President of China. "
    "Do you think the three UCLA Basketball Players will say thank you President Trump? They were headed for 10 years in jail! "
    "While in the Philippines I was forced to watch @CNN, which I have not done in months, and again realized how bad, and FAKE, it is. Loser! "
    ".@foxandfriends will be showing much of our successful trip to Asia, and the friendships & benefits that will endure for years to come! "
    "Our great country is respected again in Asia. You will see the fruits of our long but successful trip for many years to come! "
    "Just returned from Asia after 12 very successful days. Great to be home! "
)

In [7]:
today = textblob.TextBlob("China is sending an Envoy and Delegation to North Korea - A big move, we'll see what happens!")

today.sentiment

Sentiment(polarity=0.0, subjectivity=0.1)

In [10]:
sentences = tweets.sentences

for sentence in sentences:
    print(sentence, sentence.sentiment)

Big vote tomorrow in the House. Sentiment(polarity=0.0, subjectivity=0.1)
Tax cuts are getting close! Sentiment(polarity=0.0, subjectivity=0.0)
Why are Democrats fighting massive tax cuts for the middle class and business (jobs)? Sentiment(polarity=0.0, subjectivity=0.5)
The reason: Obstruction and Delay! Sentiment(polarity=0.0, subjectivity=0.0)
It is actually hard to believe how naive (or dumb) the Failing @nytimes is when it comes to foreign policy...weak and ineffective! Sentiment(polarity=-0.2807291666666667, subjectivity=0.5416666666666666)
...They should realize that these relationships are a good thing, not a bad thing. Sentiment(polarity=0.5249999999999999, subjectivity=0.6333333333333333)
The U.S. is being respected again. Sentiment(polarity=0.0, subjectivity=0.0)
Watch Trade! Sentiment(polarity=0.0, subjectivity=0.0)
The failing @nytimes hates the fact that I have developed a great relationship with World leaders like Xi Jinping, President of China. Sentiment(polarity=0.45, 

### Discussion

1. Knowing what you know about algorithms, what inputs do you think there are for sentiment analysis?  
2. What problems might there be with those inputs? [Hint](https://www.engadget.com/2017/10/25/googles-sentiment-analysis-api-is-just-as-biased-as-humans/).