# Project: Wrangle and Analyze Data (WeRateDogs)


## Table of Contents
<ul>
<li><a href="#intro">1. Introduction</a></li>
<li><a href="#gathering">2. Gathering</a></li>
<li><a href="#assessing">3. Assessing</a></li>
<li><a href="#cleaning">4. Cleaning</a></li>
<li><a href="#visualization">5. Analysis and Visualization</a></li>

</ul>

<a id='intro'></a>

In this project, I wrangled **WeRateDogs** Twitter data to create interesting and trustworthy analyses and visualizations. Since the Twitter archive only contains very basic tweet information, I additionaly gathered data using Tweeter API and combined with the WeRateDogs Twitter data. The combined data was assessed and cleaned to get insightful analyses and visualizations. 

### Data 
#### WeRateDog Twitter Archive 

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).
![image.png](https://video.udacity-data.com/topher/2017/October/59dd4791_screenshot-2017-10-10-18.19.36/screenshot-2017-10-10-18.19.36.png)

#### Additional Data via the Twitter API

Retweet count and favorite count are very important information but these values are omitted. So I gathered these information through Twitter's API for all 5000+ tweet IDs within the enhanced tweetter archive file. 

#### Twitter Image Predictions File
This file contains the dog breed classification results from a Nuerual Network model for every images in the WeRateDogs Twitter archive. This file has a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images)
![image.png](https://video.udacity-data.com/topher/2017/October/59dd4d2c_screenshot-2017-10-10-18.43.41/screenshot-2017-10-10-18.43.41.png)

<a id='gathering'></a>
## 2. Gathering

In this part I gathered data for this project.
1. The **WeRateDogs Twitter Archive** data is saved as the `twitter_archive_enhanced.csv` file.
2. **Twitter image prediction file** `image_predictions.tsv` is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv 
3. Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. 
Using the tweet IDs in the WeRateDogs Twitter archive, I **queryed the Twitter API** for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called `tweet_json.txt` file. Each tweet's JSON data was written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. 


### 2.1 Gather & Check WeRateDogs Twitter Archive file


In [None]:
#!pip install tweepy

In [None]:
#Import required libraries
import requests
import pandas as pd
import numpy as np
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#Read file and display first few lines
df=pd.read_csv("twitter-archive-enhanced.csv")
df.head()
#df.shape

In [None]:
#Check number of rows and null values, types etc.
df.info()

### 2.2 Download Twitter Image Predictions File and 

In [None]:
url='https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
#print(response.content)
with open('image-predictions.tsv', 'wb') as file:
    file.write(response.content)

image_pred=pd.read_csv("image-predictions.tsv",sep='\t')
image_pred.head()


In [None]:
#Check number of rows and null values, types etc.
image_pred.info()

### 2.3 Crawl Twitter data 

In [None]:

# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''


auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = df.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)


In [None]:
#Check out failed queries.
for key in fails_dict.keys():
    print(key, fails_dict[key])
len(fails_dict)

In [None]:
#Retry crawling for two tweet ids which were occured for connection errors. 
#Other errors represent that ID is no longer exist or I have no permission.

failure_ids=[758740312047005698,676957860086095872]
retry_fails_dict={}
with open('tweet_json.txt', 'a') as outfile:
    for tweet_id in failure_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            retry_fails_dict[tweet_id] = e
            pass
print(fails_dict)

In [None]:
tweet_json = []
# read in the json file line by line into a list
with open("tweet_json.txt") as file:
    for line in file:
        tweet_json.append(json.loads(line))

In [None]:
tweet_json[19]

In [None]:
# create a data frame containing the relevant api data
api_data = pd.DataFrame({'tweet_id': [i["id_str"] for i in tweet_json], 
     'retweet_count': [i["retweet_count"] for i in tweet_json], 
     'favorite_count': [i["favorite_count"] for i in tweet_json], 
     'retweet_count' : [i["retweet_count"] for i in tweet_json],
     'retweeted' : [i["retweeted"] for i in tweet_json],
     'followers_count': [i["user"]["followers_count"] for i in tweet_json], 
     'friends_count' :[i['user']['friends_count'] for i in tweet_json]               
     })
api_data.to_csv('api_data.csv')

In [None]:
api_data.head()

<a id='assessing'></a>
## 3. Assessing

I used Jupyter notebook and other tools(spreadsheet) to invistigate data first. And then later, progrmatically assessed data. 

Text column from twitter Archive seems to have mulitple information such as image url, review comments, ratings. This information later used to correct ratings value (rating_numerator & rating_denominator). 

Crawled Twitter Data seems to have incorrect informations which were due to limitation in my permission or subscription level. 

The detailed observation results are written in <a href="#observation">3.3 Observation</a>

### 3.1 Visual Assessment


<img src="img/weratedog_raw.png" alt="weratedog" title="WeRateDog Twitter Archive" width="800" height="300" />
<img src="img/image_pred.png " alt="image_pred" title="Image Prediction" width="800" height="300" />
<img src="img/api-data.png " alt="image_pred" title="API Data" width="700" height="300" />

In [None]:
df

In [None]:
image_pred

In [None]:
api_data

### 3.2 Programmatic assessment
Pandas functions and/or methods are used to assess the data.

In [None]:
df.info()

In [None]:
sum(df['tweet_id'].duplicated())

In [None]:
df[df['in_reply_to_status_id'].isnull()==False]

In [None]:
df[df['retweeted_status_id'].isnull()==False]

In [None]:
numerator_unique_values=numerator_count.index.unique()
numerator_unique_values

index_list=[]
for i in numerator_unique_values:
    if i >15:
        l=df[df['rating_numerator']==i].index
        #print(str(list(l)))
        index_list.append(list(l))
print(index_list)

In [None]:
for i in index_list:
    for j in i:
        print(j,df.rating_numerator[j], df.text[j])

In [None]:
denom_count=df.rating_denominator.value_counts()
denom_count

In [None]:
unique_values=denom_count.index.unique()
unique_values

In [None]:
index_list=[]
for i in range(1,len(unique_values)):
    l=df[df['rating_denominator']==unique_values[i]].index
    index_list.append(list(l))
print(index_list)

In [None]:
for i in index_list:
    for j in i:
        print(j,df.tweet_id[j],df.rating_denominator[j], df.text[j])

In [None]:
df.loc[342]

In [None]:
image_pred.info()

In [None]:
sum(image_pred['tweet_id'].duplicated())

In [None]:
sum(image_pred['jpg_url'].duplicated())

In [None]:
#image_pred[image_pred['jpg_url'].duplicated(keep=False)==True]
pd.concat(g for _, g in image_pred.groupby("jpg_url") if len(g) > 1)

In [None]:
sum(image_pred['tweet_id'].isnull())

In [None]:
print(image_pred.query('p1_dog==True').p1.unique())
print(image_pred.query('p2_dog==True').p2.unique())
print(image_pred.query('p3_dog==True').p3.unique())

In [None]:
api_data.info()

In [None]:
api_data.friends_count.nunique()
##friends column are not required

In [None]:
api_data.followers_count.value_counts()

<a id='observation'></a>
### 3.3 Observations

**WeRateDogs Twitter Archive**

- **doggo, floofer,pupper,puppo** columns are not True/False values. Actual doggo, floofer, pupper, puppo stage names are exist.
- Retweets are may not be used for analysis.
- **retweeted_status_id,retweeted_status_user_id,in_reply_to_status_id,in_reply_to_user_id**s non null values are float values.
- Wiered ratings observed in **rating_numerator & ratings_denominator**: 
    + No clues for actual ratings (666/10, 182/10, 1776/10, All time 24/7, Date 11/15/15, 20/10, snoop dog 420/10, 4/20(tweet id: 686035780142297088))
    + Only part of decimal numbers were extracted for numerator(11.27/10, 9.75/10, 11.26/10)
    + Ratings for Multiple dogs in a image get aggreated ratings (44/40,50/50, 165/150, 84/70,88/80, 144/120,143/130,45/50,99/90, 121/110, 204/170) 
    + Extracted duplicated OO/OO format in text column (Current value --> Updated value) 
      (Event 9/11--> 14/10,Size3 1/2 legged --> 9/10, 50/50 --> 11/10, 17/10 --> 13/10, 960/00 -->13/10, 4/20 --> 13/10) 

**Image_prediction**
- Duplicated image predictions (66 duplicates)
- Images with multiple dogs 

**API DATA**
- **friend count** is not real data (Twitter limitation)



<a id='clean'></a>
## 4. Clean

Based on Observation in previous Step, I have cleaned the data for quality and tidiness.Also I have combined all three datasets into one dataframe so that I can do more analysis in the next step.
The steps that I have gone through are belows.

 + 4.1 Tiwtter_Archive: Delete Retweets
 + 4.2 Twitter_Archive: Drop columns that are not used
 + 4.3 Twitter_Archive: Create dogs stage columns and drop doggo,floofer,pupper, puppo
 + 4.4 Twitter_archive: Create Year, Month, Day colums from timestamp
 + 4.5 Twitter_Archive: Correct values of ratings_numberator & ratings_denominator
 + 4.6 Image_Prediction: Drop duplicated Image prediction based on url
 + 4.7 Image_prediction: Create 1 column for image prediction and 1 column for confidence level
 + 4.8. Image_prediction:  Delete columns that are not used
 + 4.9 API-DATA: Change type for tweet_id
 + 4.10 API-DATA: Drop Friends_count, retweeted column
 + 4.11 Merge dataframes

In [None]:
# Make a copy of the tables before cleaning
df_clean = df.copy()
image_pred_clean = image_pred.copy()
api_data_clean = api_data.copy()

### 4.1 Tiwtter_Archive: Delete Retweets

In [None]:
#CODE: Delete retweets by filtering the NaN of retweeted_status_user_id
df_clean = df_clean[pd.isnull(df_clean['retweeted_status_user_id'])]

#TEST
print(sum(df_clean.retweeted_status_user_id.value_counts()))

### 4.2 Twitter_Archive: Drop columns that are not used

In [None]:
#get the column names of twitter_archive_clean
print(df_clean.columns)

#CODE: Delete columns no needed
df_clean = df_clean.drop(['source','in_reply_to_status_id','in_reply_to_user_id',
                           'retweeted_status_id','retweeted_status_user_id', 
                            'retweeted_status_timestamp', 'expanded_urls'], 1)                                                   

In [None]:
#TEST
df_clean.columns

### 4.3 Twitter_Archive: Create dogs stage columns and drop doggo,floofer,pupper, puppo

In [None]:
#CODE: Melt the doggo, floofer, pupper and puppo columns to dogs and dogs_stage column
df_clean = pd.melt(df_clean, id_vars=['tweet_id','timestamp','text','rating_numerator',
                                       'rating_denominator','name'],                                                            
                               var_name='dogs', value_name='dogs_stage')

#CODE: drop dogs
df_clean = df_clean.drop('dogs', 1)

#CODE: Sort by dogs_stage then drop duplicated based on tweet_id except the last occurrence
df_clean = df_clean.sort_values('dogs_stage').drop_duplicates(subset='tweet_id', keep='last')


In [None]:
#TEST
df_clean['dogs_stage'].value_counts()

### 4.4 Twitter_archive: Create Year, Month, Day colums from timestamp

In [None]:
#CODE: convert timestamp to datetime
df_clean['timestamp'] = pd.to_datetime(df_clean['timestamp'])

#extract year, month and day to new columns
df_clean['year'] = df_clean['timestamp'].dt.year
df_clean['month'] = df_clean['timestamp'].dt.month
df_clean['day'] = df_clean['timestamp'].dt.day

#Finally drop timestamp column
df_clean = df_clean.drop('timestamp', 1)

In [None]:
df_clean.head()

### 4.5 Twitter_Archive: Correct values of ratings_numberator & ratings_denominator

#### 4.5.1 Delete uncorrect ratings which I cannot get clue for actual values. 
- No clues for actual ratings 
+ 666/10
+ 182/10
+ 1776/10 
+ All time 24/7
+ Date 11/15/15
+ 20/10
+ snoop dog 420/10
+ 4/20(tweet id: 686035780142297088))

In [None]:
#Code: Check number of rows before deleting rows
print("# of Rows before drop:",df_clean.shape[0])
#Code: Check index with above values
with pd.option_context('max_colwidth', 200):
    print(df_clean.text[df_clean.query('rating_numerator==666 and rating_denominator==10').index[0]])
    print(df_clean.text[df_clean.query('rating_numerator==182 and rating_denominator==10').index[0]])
    print(df_clean.text[df_clean.query('rating_numerator==1776 and rating_denominator==10').index[0]])
    print(df_clean.text[df_clean.query('rating_numerator==24 and rating_denominator==7').index[0]])
    print(df_clean.text[df_clean.query('rating_numerator==11 and rating_denominator==15').index[0]])
    print(df_clean.text[df_clean.query('rating_numerator==20 and rating_denominator==16').index[0]])
    print(df_clean.text[df_clean.query('rating_numerator==420 and rating_denominator==10').index[0]])
    print(df_clean.text[df_clean.query('rating_numerator==420 and rating_denominator==10').index[1]])
    print(df_clean.text[df_clean.query('tweet_id==686035780142297088').index[0]])

In [None]:
#Code: delete rows with wired values 
df_clean.drop(df_clean.query('rating_numerator==666 and rating_denominator==10').index[0],inplace=True)
df_clean.drop(df_clean.query('rating_numerator==182 and rating_denominator==10').index[0], inplace=True)
df_clean.drop(df_clean.query('rating_numerator==1776 and rating_denominator==10').index[0],inplace=True)
df_clean.drop(df_clean.query('rating_numerator==24 and rating_denominator==7').index[0],inplace=True)
df_clean.drop(df_clean.query('rating_numerator==11 and rating_denominator==15').index[0],inplace=True)
df_clean.drop(df_clean.query('rating_numerator==20 and rating_denominator==16').index[0],inplace=True)
df_clean.drop(df_clean.query('rating_numerator==420 and rating_denominator==10').index[0],inplace=True)
df_clean.drop(df_clean.query('rating_numerator==420 and rating_denominator==10').index[0],inplace=True)
df_clean.drop(df_clean.query('tweet_id==686035780142297088').index[0],inplace=True)


#TEST: see 8 rows are deleted
df_clean.shape[0]

#### 4.5.2 Update  miscaptured rating values with actual ratings

**Current value --> Updated value** <br/>
+ Event 9/11--> 14/10
+ Size3 1/2 legged --> 9/10
+ 50/50 --> 11/10
+ 17/10 --> 13/10
+ 960/00 -->13/10
+ 4/20 --> 13/10
+ 7/11 -->10/10


In [None]:
#Code: Find location for ratings above
error_list=[]
t=df_clean.query('rating_numerator==9 and rating_denominator==11').index[0]
error_list.append(t)
t=df_clean.query('rating_numerator==7 and rating_denominator==11').index[0]
error_list.append(t)
t=df_clean.query('rating_numerator==1 and rating_denominator==2').index[0]
error_list.append(t)
t=df_clean.query('rating_numerator==50 and rating_denominator==50').index[0]
error_list.append(t)
t=df_clean.query('rating_numerator==17 and rating_denominator==10').index[0]
error_list.append(t)
t=df_clean.query('rating_numerator==960 and rating_denominator==00').index[0]
error_list.append(t)
t=df_clean.query('rating_numerator==4 and rating_denominator==20').index[0]
error_list.append(t)
print(error_list)

In [None]:
#Code: find proper value(second set value extracted by regular expressions) from text and update numerator
#error_list=[3065,2154,3199,52,2438,3594]
regex = re.compile(r"(\d+\/\d+)")

for i in error_list:
    t=df_clean.text[i]   
    numerator=regex.findall(t)[1].split('/')[0]
    denominator=regex.findall(t)[1].split('/')[1]
    print(i,t,numerator,denominator)
    df_clean.loc[(df_clean.index==i),'rating_numerator']=numerator
    df_clean.loc[(df_clean.index==i),'rating_denominator']=denominator

#TEST: cehck replaced values
with pd.option_context('max_colwidth', 200):
    display(df_clean[df_clean.index.isin(error_list)==True][['text','rating_numerator','rating_denominator']])


#### 4.5.3 Change ratings columns to float values types and correct numerator values with decial point

In [None]:
#Code: change int type to float
df_clean[['rating_numerator', 'rating_denominator']] = df_clean[['rating_numerator','rating_denominator']].astype(float)

# Test: check types of ratins
df_clean.info()

In [None]:
##Check & Update correct numerator for decimal values 
import re
regex = re.compile(r"(\d+\.\d*\/\d+)")


index_list=df_clean[df_clean['text'].str.contains(r"(\d+\.\d*\/\d+)")==True].index
for i in index_list:
    t=df_clean.text[i]   
    numerator=regex.findall(t)[0].split('/')[0]
    print(i,t,numerator)
    df_clean.loc[(df_clean.index==i),'rating_numerator']=float(numerator)
    #print(df_clean.text[i].str.extract('(\d+\.\d*\/\d+)',expand=True).loc[i])


#TEST
with pd.option_context('max_colwidth', 200):
    display(df_clean[df_clean['text'].str.contains(r"(\d+\.\d*\/\d+)")]
            [['tweet_id', 'text', 'rating_numerator', 'rating_denominator']])

#### 4.5.4 Change aggregated ratings for multiple dos in a single images

Look for values of 44/40,50/50, 165/150, 84/70,88/80, 144/120,143/130,45/50,99/90, 121/110, 204/170

In [None]:
denom_count=df_clean.rating_denominator.value_counts()
denom_count

In [None]:
df_temp=df_clean[df_clean.rating_denominator!=10.0][['rating_numerator','rating_denominator']]
index_list=list(df_temp.index)
#print(index_list)
df_temp['num_dogs']=df_temp.rating_denominator/10
df_temp['new_rating_numerator']=df_temp.rating_numerator/df_temp.num_dogs
df_temp['new_rating_numerator']=df_temp['new_rating_numerator'].astype(float)

df_clean.loc[(df_clean.rating_denominator!=10.0), 'rating_numerator']=df_temp.new_rating_numerator
df_clean.loc[(df_clean.rating_denominator!=10.0), 'rating_denominator']=10.0

with pd.option_context('max_colwidth', 200):
    display(df_clean[df_clean.index.isin(index_list)==True][['text','rating_numerator','rating_denominator']])


In [None]:
df_clean.head()

### 4.6 Image_Prediction: Drop duplicated Image prediction based on url

In [None]:
#CODE: Delete duplicated jpg_url
image_pred_clean = image_pred_clean.drop_duplicates(subset=['jpg_url'], keep='last')

#TEST
sum(image_pred_clean['jpg_url'].duplicated())

### 4.7 Image_prediction: Create 1 column for image prediction and 1 column for confidence level

Create a function where I keep the first true prediction along the confidence level as new columns. 

In [None]:
#CODE: the first true prediction (p1, p2 or p3) will be store in these lists
dog_type = []
confidence_list = []

#create a function with nested if to capture the dog type and confidence level
# from the first 'true' prediction
def image(image_pred_clean):
    if image_pred_clean['p1_dog'] == True:
        dog_type.append(image_pred_clean['p1'])
        confidence_list.append(image_pred_clean['p1_conf'])
    elif image_pred_clean['p2_dog'] == True:
        dog_type.append(image_pred_clean['p2'])
        confidence_list.append(image_pred_clean['p2_conf'])
    elif image_pred_clean['p3_dog'] == True:
        dog_type.append(image_pred_clean['p3'])
        confidence_list.append(image_pred_clean['p3_conf'])
    else:
        dog_type.append('None')
        confidence_list.append('None')

#series objects having index the image_pred_clean column.        
image_pred_clean.apply(image, axis=1)

#create new columns
image_pred_clean['dog_type'] = dog_type
image_pred_clean['confidence_list'] = confidence_list


In [None]:
#drop rows that has prediction_list 'error'
image_pred_clean = image_pred_clean[image_pred_clean['dog_type'] != 'None']

#TEST: 
image_pred_clean.info()

### 4.8. Image_prediction:  Delete columns that are not used

In [None]:
#CODE: print list of image_prediction columns
print(list(image_pred_clean))

#Delete columns
image_pred_clean = image_pred_clean.drop(['img_num', 'p1', 
                                                      'p1_conf', 'p1_dog', 
                                                      'p2', 'p2_conf', 
                                                      'p2_dog', 'p3', 
                                                      'p3_conf', 
                                                      'p3_dog'], 1)

#TEST
image_pred_clean.head()

In [None]:
image_pred_clean.head()

### 4.9 API-DATA: Change type for tweet_id

In [None]:
#CODE: change tweet_id from str to int
api_data_clean['tweet_id'] = api_data_clean['tweet_id'].astype(int)

#TEST
api_data_clean['tweet_id'].dtypes

### 4.10 API-DATA: Drop Friends_count, retweeted column

In [None]:
#CODE: Delete retweeted, friends_count column
print(api_data_clean.columns)
api_data_clean=api_data_clean.drop(columns=['retweeted','friends_count'])
#TEST
print(api_data_clean.columns)

### 4.11 Merge dataframes

In [None]:
#CODE: create a new dataframe that merge df_clean and image_pred
dfs = pd.merge(df_clean, 
                      image_pred_clean, 
                      how = 'left', on = ['tweet_id'])

#keep rows that have picture (jpg_url)
dfs = dfs[dfs['jpg_url'].notnull()]

#TEST
dfs.info()

In [None]:
#CODE: create a new dataframe that merge dfs and api_data
df_twitter = pd.merge(dfs, api_data_clean, 
                      how = 'left', on = ['tweet_id'])

#TEST
df_twitter.info()

In [None]:
df_twitter.head()

In [None]:
df_twitter['rating_numerator'].value_counts()

### 4.12 Save cleaned data in "twitter_archive_master.csv"

In [None]:
#Store the clean DataFrame in a CSV file
df_twitter.to_csv('twitter_archive_master.csv', 
                 index=False, encoding = 'utf-8')