<h1>Simplify the Dataset By Converting JSON to CSV and Extracting Subset of Parameters</h1>
<p>In this example, we reduce the dataset from ~70 Mb down to ~5 Mb, and we will save two files, one containing basic tweet parameters and the other containing tweet text. This is probably not necessary for such a small dataset but it becomes useful when working with much larger datasets consisting of millions of tweets.</p>
<p>The parameters we choose to retain in the CSV include the following, which is by no means comprehensive:
<ul>
<li>tweet_id</li>
<li>tweet_created_at</li>
<li>language</li>
<li>user_screen_name</li>
<li>user_created_at</li>
<li>user_id</li>
<li>followers_count</li>
<li>friends_count</li>
<li>time_zone</li>
<li>utc_offset</li>
<li>retweeted_status</li>
<li>retweet_id</li>
<li>retweet_user_screen_name</li>
<li>retweet_user_id</li>
</ul>
<p>The dataset we will be working with was created using the Twitter Search API and searching on the hashtag 'nerd'. Tweets were collected every 15 minutes and saved to a file. After two weeks the files were processed to remove duplicate tweets and combined into a single file. Duplicate tweets are an artifact of requesting the maximum number of tweets for each 15 minute epoch. Twitter limits the Search API to 100 tweets per 15 minute epoch. They post 150 in their documentation but we have observed it to be 100. </p> 
<h2>Import Packages</h2>
<p>As always, first we import the required Python packages.</p>

In [14]:
# Load packages
import os
import csv
import sys
import json
import datetime
from pprint import pprint

<h2>Open JSON, Parse Data, Save as CSV</h2>
<p>The comments in the code below describe the purpose of each section of code.</p>

In [15]:
# Print start time at start and end time at end
print("Start:" + str(datetime.datetime.now()))

# Open CSV output files for writing
output_dir = "csv/"
hashtag = "nerd"

# Open main twitter data CSV file and write header row
main_output_file = output_dir + hashtag + "_main.csv"
f_main = open(main_output_file, 'w', newline='')
mainrowwriter = csv.writer(f_main, delimiter=',')
main_outputstring = ['tweet_id','tweet_created_at','language','user_screen_name','user_created_at','user_id','followers_count','friends_count','time_zone','utc_offset','retweeted_status','retweet_id','retweet_user_screen_name','retweet_user_id']
mainrowwriter.writerow(main_outputstring)

# Open twitter text data CSV file and write header row
text_output_file = output_dir + hashtag + "_text.csv"
f_text = open(text_output_file, 'w', errors='ignore', newline='')
textrowwriter = csv.writer(f_text, delimiter=',')
text_outputstring = ['tweet_id','text']
textrowwriter.writerow(text_outputstring)

# Define variables
inc = 0
val = 0
val_inc = 0
dir = 'tweet_data/'
filename = 'nerd.json'

with open(dir + filename, 'r') as f:
    print("Working on file:" + filename)
    data = json.load(f)
    for tweet in data:
        if 'user' in tweet:
            
            # Set standard variables equal to tweet data
            tweet_id = tweet['id']
            tweet_created_at = tweet['created_at']
            text = tweet['text']
            language = tweet['lang']
            user_screen_name = tweet['user']['screen_name']
            user_created_at = tweet['user']['created_at']
            user_id = tweet['user']['id']
            followers_count = tweet['user']['followers_count']
            friends_count = tweet['user']['friends_count']
            utc_offset = tweet['user']['utc_offset']
            time_zone = tweet['user']['time_zone']
            
            # Check if a retweet else original tweet
            if 'retweeted_status' in tweet:
                retweeted_status = 1
                retweet_id = tweet['retweeted_status']['id']
                retweet_user_screen_name = tweet['retweeted_status']['user']['screen_name']
                retweet_user_id = tweet['retweeted_status']['user']['id']
            else:
                retweeted_status = 0
                retweet_id = "None"
                retweet_user_screen_name = "None"
                retweet_user_id = "None"
            
            # Write to main output file
            main_outputstring = [str(tweet_id), tweet_created_at, language, user_screen_name, user_created_at, str(user_id), str(followers_count), str(friends_count), time_zone, utc_offset, str(retweeted_status), str(retweet_id), retweet_user_screen_name, str(retweet_user_id)] 
            mainrowwriter.writerow(main_outputstring)
            
            # Write to text output file
            text_outputstring = [str(tweet_id), text]
            textrowwriter.writerow(text_outputstring)
            
            # Increment variables to track progress, mostly for very large files
            inc += 1
            val_inc += 1
            if val_inc > 10000:
                val = val + 10000
                print(str(val))
                val_inc = 0

# Close all files
f.close()
f_main.close()
f_text.close()
print("End:" + str(datetime.datetime.now()))

Start:2018-12-08 06:56:35.276824
Working on file:nerd.json
10000
End:2018-12-08 06:56:37.554748


In [16]:
print("Total number of tweets:" + str(inc-1))

Total number of tweets:15368
