# Data analysis of Saskatoon tweets

First, let's import the pandas.

In [11]:
import json
import pandas as pd
import matplotlib.pyplot as plt

# Enable inline plotting
%matplotlib inline

## Defining a dataframe extraction function and cleaning up the dataset

This function takes in a JSON input file, in our case `stoonTweets.json`, and fills a pandas dataframe with it.

In [6]:
def pop_tweets(inputFile):

    # Project proposal outlines these columns: 
    # text, author, timestamp, hashtags, retweet count, location (for geotagged tweets), and source

    #Declare a new data frame with pandas, with some specific column names
    tweets = pd.DataFrame(columns=[
    	'userHandle','text','timestamp','location','retweet count','source'
    	])

    #Open the text file that contains the tweets we collected
    tweets_file = open(inputFile, "r")
    
    #Read the text file line by line
    for line in tweets_file:
        
        #Load the JSON information
        tweet = json.loads(line)
        
        #If the tweet isn't empty, add it to the data frame
        if ('text' in tweet): 
            tweets.loc[len(tweets)]=[tweet['user']['screen_name'],tweet['text'],\
            	tweet['created_at'],tweet['place']['full_name'],tweet['retweet_count'],\
            	tweet['source']
            ]

    return tweets

Now let's see our dataframe content.

In [64]:
pd.set_option('display.max_colwidth', 200)  
pd.set_option('display.max_rows', 3000)
yxe_tweets

Unnamed: 0,userHandle,text,timestamp,location,retweet count,source
0,TheCandyShow,"Gord Downie Was Celebrated For Championing Indigenous Rights. Now That He's Gone, Do People Still Care?‚Ä¶ https://t.co/eITxoT6skw",Mon Dec 04 21:13:31 +0000 2017,"Saskatoon, Saskatchewan",0,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>"
1,LuxquisiteStyle,Sunday Dec 10th 12-4pm! Our last pop-up before Christmas! üéÑ Check out the link for details. https://t.co/07HHHTBGsa https://t.co/gSvHJadsxT,Mon Dec 04 21:11:45 +0000 2017,Luxquisite Clothing,0,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>"
2,notsogoodal,Met a person who would not stop hitting me in the arm with the back of her hand during our entire conversation. Why‚Ä¶ https://t.co/1woJhOOkwW,Mon Dec 04 21:08:20 +0000 2017,"Saskatoon, Saskatchewan",0,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>"
3,snowded,@arifbobat @Mark_Nilsen_ @thisagileguy @tobiasmayer @alshalloway @RonJeffries Human systems tend to order - it may‚Ä¶ https://t.co/6JmX7aH2kr,Mon Dec 04 21:08:06 +0000 2017,"Saskatoon, Saskatchewan",0,"<a href=""http://tapbots.com/software/tweetbot/mac"" rel=""nofollow"">Tweetbot for Mac</a>"
4,Skaboomatude,Drinking an Angus Stout by @9milelegacy @ The Rook &amp; Raven ‚Äî https://t.co/6c9gWxexPj,Mon Dec 04 21:08:00 +0000 2017,"Saskatoon, Saskatchewan",0,"<a href=""https://untappd.com"" rel=""nofollow"">Untappd</a>"
5,TheCandyShow,Look at what @aircanada hooked me up with for my birthday!! #skyqueen #birthdaygirl @ Sheraton‚Ä¶ https://t.co/U92W59yRGM,Mon Dec 04 21:04:57 +0000 2017,"Saskatoon, Saskatchewan",0,"<a href=""http://instagram.com"" rel=""nofollow"">Instagram</a>"
6,Stoon_Slar,"@BarristerSecret Kinda like the quivalent of, ‚ÄúThe dog wrote my homework.‚Äù",Mon Dec 04 21:01:11 +0000 2017,"Saskatoon, Saskatchewan",0,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>"
7,WASHDUDE,True. Just sayin if you get stuck. It‚Äôs hard to dig out with pot cap limit there https://t.co/noDY19TkD4,Mon Dec 04 21:00:01 +0000 2017,"Saskatoon, Saskatchewan",0,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>"
8,snowded,@Mark_Nilsen_ @arifbobat @thisagileguy @tobiasmayer @alshalloway @RonJeffries Yep - made it more explicit with the‚Ä¶ https://t.co/iCk4fB1uUu,Mon Dec 04 20:53:11 +0000 2017,"Saskatoon, Saskatchewan",0,"<a href=""http://tapbots.com/software/tweetbot/mac"" rel=""nofollow"">Tweetbot for Mac</a>"
9,snowded,@alshalloway @Mark_Nilsen_ @arifbobat @thisagileguy @tobiasmayer @RonJeffries And your flex site is good old cybern‚Ä¶ https://t.co/PLJERgKk3B,Mon Dec 04 20:50:31 +0000 2017,"Saskatoon, Saskatchewan",0,"<a href=""http://tapbots.com/software/tweetbot/mac"" rel=""nofollow"">Tweetbot for Mac</a>"


Let's do some cleaning up on the dataframe. We remove the `<a href` tags from the source column, and some extraneous information in the timestamp column. Luckily, this is easy to do using pandas:

In [79]:
# Populate the pandas dataframe with our JSON file
yxe_tweets = pop_tweets('stoonTweets.json')

In [78]:
# yxe_tweets['source'] = yxe_tweets.source.str.replace('\<a href="?"\>,?' , '')
# FULL REGEX: (\<a href\=.+\>)(.+)(\<\/a\>)
# df.sport.str.replace(r'(^.*ball.*$)', 'ball sport')
# df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
# yxe_tweets['source'] = yxe_tweets['source'].str.lstrip('\<a href\=\".+\" rel=\".+\"\>').str.rstrip('\<\/a\>')

yxe_tweets['source'] = yxe_tweets.source.str.replace('^\<a href\=.+\>$' , '')
yxe_tweets['source']

0       
1       
2       
3       
4       
5       
6       
7       
8       
9       
10      
11      
12      
13      
14      
15      
16      
17      
18      
19      
20      
21      
22      
23      
24      
25      
26      
27      
28      
29      
30      
31      
32      
33      
34      
35      
36      
37      
38      
39      
40      
41      
42      
43      
44      
45      
46      
47      
48      
49      
50      
51      
52      
53      
54      
55      
56      
57      
58      
59      
60      
61      
62      
63      
64      
65      
66      
67      
68      
69      
70      
71      
72      
73      
74      
75      
76      
77      
78      
79      
80      
81      
82      
83      
84      
85      
86      
87      
88      
89      
90      
91      
92      
93      
94      
95      
96      
97      
98      
99      
100     
101     
102     
103     
104     
105     
106     
107     
108     
109     
110     
1