# Loading Json Data

A variable is a **box** that can contain almost anything. Below we will take some bigger steps: instead of strings and integers, we scrutinize **a whole corpus of Tweets**. Don't worry if the code seems difficult--because it is hard at the first time. The point of this sudden acceleration is to demonstrate the power of coding, to show you that with relatively few lines of code you can accomplish a lot.

As an example we used all tweets of the current American President. These we obtained via the [Trump Twitter Archive](http://www.trumptwitterarchive.com/archive).

The database is a [JSON](https://en.wikipedia.org/wiki/JSON) file in which each item is a tweet. The cell below shows the first tweet of the collection. It may seem difficult to read JSON notation, but there are various tools to help you. Go for example to this [JSON viewer](http://jsonviewer.stack.hu/) and copy paste the text into the cell below.

``{
    "source":"Twitter for iPhone",
    "text":"The Tax Cut Bill is coming along very well, great support. With just a few changes, some mathematical, the middle class and job producers can get even more in actual dollars and savings and the pass through provision becomes simpler and really works well!",
    "created_at":"Mon Nov 27 14:24:36 +0000 2017",
    "retweet_count":15663,
    "favorite_count":79868,
    "is_retweet":false,
    "id_str":"935152378747195392"}``

Inspect the JSON file. What information is in there, what is missing? What type of questions could one answer using these data? 

**Just FYI**, the information per tweet is actually larger. Inspect the "example.json" in the previously mentioned [viewer](http://jsonviewer.stack.hu/).

Okay, let's have a closer look at the corpus, which includes all tweets after the inauguration. Pandas is a very useful library to load and interrogate data. Simply run the code below (and relax, you are not supposed to really understand everything, except maybe line 4).

In [None]:
# import the pandas library`
import pandas as pd
import requests
# read the JSON corpus, or: put all tweets in a box called trump_tweets
tweets = pd.read_json('data/trump2.json')
# or 
# url = "https://raw.githubusercontent.com/kasparvonbeelen/Coding-the-Humanities/master/lecture1/data/trump2.json"
# tweets = pd.read_json(requests.get(url).content)
# ignore for now, this simply uses the moment of posting as an index
tweets.set_index('created_at',inplace=True)
# keep only the tweets posted by Trump himself (i.e. exclude retweets)
tweets = tweets[tweets.is_retweet==False]
# print the first five rows
tweets.head(10)

With these few lines, you managed to lead the whole corpus of Trump tweets.

In [None]:
# Exercise 1: print the first 10 rows
# Exercise 2: What is the type of the tweets variable?

In [None]:
# Exercise 2:You can count the number of tweets 
# by wrapping the "len()" function around the "tweets" variable. Try it!
len(tweets)

## Exploring Data

Pandas allows you to inspect the data with the help of some descriptive statistics and plots. Run the cell below, otherwise the plots won't appear in the Notebook.

In [None]:
# Run this cell to plot all figures in the Notebook
%matplotlib inline

Does this table give you an overview of the whole dataset?

In [None]:
tweets.describe()

In [None]:
# Get the summary statistics for the retweet_count column
tweets.retweet_count.describe()

An easy way to study the popularity of Trump is to plot the number of retweets over time

In [None]:
# Plot retweets over time
tweets['retweet_count'].plot(legend=True)
tweets['favorite_count'].plot(legend=True)

In [None]:
from datetime import datetime
to_month = lambda x: datetime(x.year,x.month,1)
to_day = lambda x: datetime(x.year,x.month,x.day)

In [None]:
# plot by month or day by replacing the lambda functions
tweets['retweet_count'].groupby(to_month).mean().plot()

**Question**: Changing the unit of analysis (month or day) leads to different figures. Which, do you think, is most interesting?

A **histogram** gives an indication of the distribution of the values.

In [None]:
tweets['retweet_count'].plot(kind='hist',bins=100)

For closer inspection, you can sort the table by a certain column. 

In [None]:
#tweets.sort_values('retweet_count',ascending=False)[:10]
tweets.sort_values('retweet_count',ascending=False)[-10:]

#### Exercise
Make plots and sort the data, but this time for the **"favourite_count"** column 

In [None]:
# Add your code here

## Vader Sentiment Analyzer
[from Github](https://github.com/cjhutto/vaderSentiment): VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.

VADER uses a lexicon (a mapping of words to sentiment values, e.g bad=-1.0, good=+1.0) to compute the sentiment (positivity or negativity) of a text.

In [None]:
import nltk
nltk.download('vader_lexicon')

In [None]:
from nltk.sentiment import vader
analyzer = vader.SentimentIntensityAnalyzer()

Below you can test VADER yourself by changing the value of the ``text`` variable, and running the code block. 

Can you trick the system? Not very easy isn't it?!

In [None]:
text = "Not interesting."
vs = analyzer.polarity_scores(text)['compound']
print("{:_<65} {}".format(text, str(vs)))

Now we can easily calculate the sentiment of Trump's tweets. 

In [None]:
compound_sentiment = lambda x: analyzer.polarity_scores(x)['compound']
tweets['compound_sentiment'] = tweets['text'].apply(compound_sentiment)

### Exercises

In [None]:
# print the ten first lines of the tweets table
s =  tweets.sort_values('compound_sentiment',ascending=False)[-10:]
for i in s.text: print(i)
    

In [None]:
# make a timeline and histogram for the compound sentiment collumn

## Indexing and slicing

In [None]:
# fetch a row at position 0
print(tweets.iloc[0])

In [None]:
# Exercise get the last ten rows

If we want to find the most popular tweet, we ca sort the rows by ``retweet_count`` and take the first row.

In [None]:
sorted_by_retweet = tweets.sort_values('retweet_count',ascending=False)
print(sorted_by_retweet.iloc[0])

You can read the tweets using the following expressions:

In [None]:
print(sorted_by_retweet.iloc[0]['text'])

In [None]:
print(sorted_by_retweet.iloc[0].text)

Using the slicing technique we can retrieve the ten most popular tweets.

In [None]:
sorted_by_retweet.iloc[0:10]['text']

In [None]:
To read the full text, run the code below.

In [None]:
for r in sorted_by_retweet.iloc[0:10]['text']:
    print(r)
    print()

### Exercise
Print the ten most positive and ten most negative tweets

In [None]:
# Add your code here

Can you do the same, but instead look at the favorite_count column? Are results very different?

In [None]:
# Add your code here

### You're done for today!
If you have some energy left, play around a bit with the code below

The code below allows you to search for specific words in the Twitter corpus.

In [None]:
contains_word = lambda x,w: x.lower().find(w)
trump_tweets['contains_obama'] = trump_tweets['text'].apply(contains_word,w='obama')
about_obama = trump_tweets[trump_tweets.contains_obama > 0]
len(about_obama)

How does Trump use uppercase?

In [None]:
def count_upper(text):
    uppers = []
    for char in text:
        if char.isupper():
            uppers.append(char)
    return len(uppers)/len(text)

tweets = tweets[tweets.is_retweet==False]
tweets['uppers'] = tweets.text.apply(count_upper)
tweets.sort_values('uppers',ascending=False)[:10]