# DATA WRANGLING

PROJECT 4

MAY 9- 2020

#### NIMMY GEORGE

Data Analyst Nanodegree


#### Introduction
In this project, I will be able to document the wrangling efforts during a Jupyter Notebook, and showcase them through analyses and visualizations using Python (and its libraries)

The dataset will be used for wrangling (and analyzing and visualizing) is that the tweet archive of Twitter user @dog_rates, also referred to as WeRateDogs. WeRateDogs may be a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings nearly always have a denominator of 10. The numerators, though? nearly always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

### The Data

#### Enhanced Twitter Archive
The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to form this Twitter archive "enhanced.".We manually downloaded this file manually by clicking the subsequent link: twitter_archive_enhanced.csv which can be downloaded by the udacity portal

#### Additional Data via the Twitter API
Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this extra data are often gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most up-to-date tweets, at least. But we, because we've the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. And guess what? We're getting to query Twitter's API to collect this valuable data.

#### Image Predictions File
The tweet image predictions about breed of dog is present in each tweet consistent with neural network. The file (image-predictions-3.tsv) hosted on Udacity's servers and downloaded it programmatically using python Requests library on the subsequent (URL of the file: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv)



#### Key Points
Key points to stay in mind when data wrangling for this project:

1.We only want original ratings (no retweets) that have images. Though there are 5000+ tweets within the dataset, not all are dog ratings and a few are retweets.So we want to get the required ratings.

2.Fully assessing and cleaning the whole dataset requires exceptional effort so only a subset of its issues (eight (8) quality issues and two (2) tidiness issues at minimum) got to be assessed and cleaned.

3.Cleaning includes merging individual pieces of knowledge consistent with the principles of tidy data.

4.The fact that the rating numerators are greater than the denominators doesn't got to be cleaned. This unique scoring system may be a big a part of the recognition of WeRateDogs.

5.We don't got to gather the tweets beyond August 1st, 2017. We can, but note that we cannot be ready to gather the image predictions for these tweets since we do not have access to the algorithm used.

#### Project Details
Fully assessing and cleaning the whole dataset would require exceptional effort so only a subset of its issues (eight quality issues and two tidiness issues at minimum) needed to be assessed and cleaned.

The tasks for this project were:

Data wrangling, which consists of:
    1.Gathering data
    2.Assessing data
    3.Cleaning data
    4.Storing data
    5.Analysing data
    6.Visualizing data
Reporting on 1)Data wrangling efforts
             2)Data analysis and visualizations

In [1]:
import pandas
import numpy 
import requests
import tweepy
import os
import json
import time
import re
import matplotlib.pyplot as plt
import warnings


Gathering of Data

In [2]:
twitter = pandas.read_csv('twitter-archive-enhanced.csv')
twitter.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [3]:
twitter.info()##to get the required info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [4]:
## getting the needed images
url="https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = requests.get(url)
with open('image_predictions.tsv', 'wb') as file:
    file.write(response.content)
image_predictions = pandas.read_csv('image_predictions.tsv', sep='\t')

In [5]:
##getting the predictions of image
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


##### Querying the Twitter API for each tweet's JSON data using Tweepy library in python library and store entire set of JSON data of tweet in a file.

In [85]:
##Required query
CONSUMER_KEY = "2ACERKpTJGGpkizCCnHMwIaTY"## these are removed due to security reasons
CONSUMER_SECRET = "YGE7O5WXw5Yi1cE6CLCXkeHBsSVcqwkhfwcItL5XobsIfmHdjf"## these are removed due to security reasons
OAUTH_TOKEN = "1051779522314493953-7P7z3KB7xTr2i2L7VOgPvlQ02W5IsD"## these are removed due to security reasons
OAUTH_TOKEN_SECRET = "70mYSglPxATdJR774A7AkMhPaK9VPZoZVWVfmXUHsUXQn" ## these are removed due to security reasons

In [86]:
##Getting the required keys and token
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tweepy.API(auth)

In [87]:
# gives the error tweets list
error_list = []
# gives the tweets list
df_list = []
# give execution time
start = time.time()

# for loop used to add json values to list
for tweet_id in twitter['tweet_id']:
    try:
        tweet = api.get_status(tweet_id, tweet_mode='extended',
                               wait_on_rate_limit = True, wait_on_rate_limit_notify = True)._json 
        favorites = tweet['favorite_count'] # no of favourites
        retweets = tweet['retweet_count'] # no of  Count of retweet
        user_followers = tweet['user']['followers_count'] # no of followers of the user
        user_favourites = tweet['user']['favourites_count'] # no of favorites of the user
        date_time = tweet['created_at'] # creation time and date
        
        df_list.append({'tweet_id': int(tweet_id),
                        'favorites': int(favorites),
                        'retweets': int(retweets),
                        'user_followers': int(user_followers),
                        'user_favourites': int(user_favourites),
                        'date_time': pandas.to_datetime(date_time)})
    except Exception as e:
        print(str(tweet_id)+ " _ " + str(e))
        error_list.append(tweet_id)
# gives the execution time
end = time.time()
print(end - start)

888202515573088257 _ [{'code': 144, 'message': 'No status found with that ID.'}]
873697596434513921 _ [{'code': 144, 'message': 'No status found with that ID.'}]
872668790621863937 _ [{'code': 144, 'message': 'No status found with that ID.'}]
872261713294495745 _ [{'code': 144, 'message': 'No status found with that ID.'}]
869988702071779329 _ [{'code': 144, 'message': 'No status found with that ID.'}]
866816280283807744 _ [{'code': 144, 'message': 'No status found with that ID.'}]
861769973181624320 _ [{'code': 144, 'message': 'No status found with that ID.'}]
856602993587888130 _ [{'code': 144, 'message': 'No status found with that ID.'}]
851953902622658560 _ [{'code': 144, 'message': 'No status found with that ID.'}]
845459076796616705 _ [{'code': 144, 'message': 'No status found with that ID.'}]
844704788403113984 _ [{'code': 144, 'message': 'No status found with that ID.'}]
842892208864923648 _ [{'code': 144, 'message': 'No status found with that ID.'}]
837366284874571778 _ [{'code

KeyboardInterrupt: 

In [88]:
# lengh of result using len function
print("The lengh of the result", len(df_list))
# The length of the errors using len function
print("The lengh of the errors", len(error_list))

The lengh of the result 564
The lengh of the errors 18


From the above results:

We reached the limit of the tweepy API three times but wait_on_rate_limit automatically wait for rate limits to re-establish and wait_on_rate_limit_notify print a notification when Tweepy is waiting.
We could length as 564 tweet_id correctly with 18 errors


In [89]:
json_tweets = pandas.DataFrame(df_list, columns = ['tweet_id', 'favorites', 'retweets',
                                               'user_followers', 'user_favourites', 'date_time'])
# Saving the dataFrame in file
json_tweets.to_csv('tweet_json.txt', encoding = 'utf-8', index=False)


In [90]:
tweet_data = pandas.read_csv('tweet_json.txt', encoding = 'utf-8')
tweet_data

Unnamed: 0,tweet_id,favorites,retweets,user_followers,user_favourites,date_time
0,892420643555336193,36297,7725,8773685,146003,2017-08-01 16:23:56+00:00
1,892177421306343426,31293,5718,8774391,146004,2017-08-01 00:17:27+00:00
2,891815181378084864,23569,3786,8773686,146003,2017-07-31 00:18:03+00:00
3,891689557279858688,39583,7880,8774392,146004,2017-07-30 15:58:51+00:00
4,891327558926688256,37789,8506,8774392,146004,2017-07-29 16:00:24+00:00
...,...,...,...,...,...,...
559,800855607700029440,0,1655,8773692,146003,2016-11-22 00:17:10+00:00
560,800751577355128832,10798,2867,8773692,146003,2016-11-21 17:23:47+00:00
561,800513324630806528,13432,3066,8774398,146004,2016-11-21 01:37:04+00:00
562,800459316964663297,9721,2229,8773692,146003,2016-11-20 22:02:27+00:00


In [91]:
## getting the data info
tweet_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564 entries, 0 to 563
Data columns (total 6 columns):
tweet_id           564 non-null int64
favorites          564 non-null int64
retweets           564 non-null int64
user_followers     564 non-null int64
user_favourites    564 non-null int64
date_time          564 non-null object
dtypes: int64(5), object(1)
memory usage: 26.6+ KB


### Gather: Summary
Gathering is the first step in the data wrangling process.

Obtaining data
Getting data from an existing file (twitter-archive-enhanced.csv) Reading from csv file using pandas
Downloading a file from the internet (image-predictions.tsv) Downloading file using requests
Querying an API (tweet_json.txt) Get JSON object of all the tweet_ids using Tweepy
Importing that data into our programming environment (Jupyter Notebook)

### Assessing

In [92]:
twitter.sample(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1058,741793263812808706,,,2016-06-12 00:44:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When your crush won't pay attention to you. Bo...,,,,https://twitter.com/dog_rates/status/741793263...,10,10,,,,,
958,751456908746354688,,,2016-07-08 16:44:23 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a pupper that's very hungry but too laz...,,,,https://twitter.com/dog_rates/status/751456908...,12,10,,,,pupper,
244,846042936437604353,,,2017-03-26 16:55:29 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jarvis. The snow pupsets him. Officially ...,,,,https://twitter.com/dog_rates/status/846042936...,12,10,Jarvis,,,,
1504,691820333922455552,,,2016-01-26 03:09:55 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Brockly. He's an uber driver. Falls as...,,,,https://twitter.com/dog_rates/status/691820333...,8,10,Brockly,,,,
1612,685321586178670592,,,2016-01-08 04:46:13 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Rocky. He sleeps like a psychopath. 10...,,,,https://twitter.com/dog_rates/status/685321586...,10,10,Rocky,,,,
520,810254108431155201,,,2016-12-17 22:43:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...","This is Gus. He likes to be close to you, whic...",,,,https://twitter.com/dog_rates/status/810254108...,12,10,Gus,,,,
1621,684926975086034944,,,2016-01-07 02:38:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Bruiser &amp; Charlie. They are the best ...,,,,https://twitter.com/dog_rates/status/684926975...,11,10,Bruiser,,,,
729,781955203444699136,,,2016-09-30 20:33:43 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Chipson. He weighed in at .3 ounces an...,,,,https://twitter.com/dog_rates/status/781955203...,11,10,Chipson,,,,
433,820690176645140481,,,2017-01-15 17:52:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",The floofs have been released I repeat the flo...,,,,https://twitter.com/dog_rates/status/820690176...,84,70,,,,,
1585,686947101016735744,,,2016-01-12 16:25:26 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Jackson. He was specifically told not ...,,,,https://twitter.com/dog_rates/status/686947101...,11,10,Jackson,,,,


In [93]:
# Assessing the data programmaticaly
twitter.info()
twitter.describe()
twitter['rating_numerator'].value_counts()
twitter['rating_denominator'].value_counts()
twitter['name'].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

None       745
a           55
Charlie     12
Cooper      11
Lucy        11
          ... 
Anthony      1
Gordon       1
Maya         1
Alf          1
Bradley      1
Name: name, Length: 957, dtype: int64

In [94]:
# View descriptive statistics of twitter
twitter.describe()


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [95]:
image_predictions##getting the image predictions

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


In [96]:
image_predictions.info()
image_predictions['jpg_url'].value_counts()
image_predictions[image_predictions['jpg_url'] == 'https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg']

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True


In [97]:
twitter.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [98]:
#This is used to get the rating that don't follow pattern
twitter[twitter['rating_numerator'] > 20]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
188,855862651834028034,8.558616e+17,194351800.0,2017-04-22 19:15:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@dhmontgomery We also gave snoop dogg a 420/10...,,,,,420,10,,,,,
189,855860136149123072,8.558585e+17,13615720.0,2017-04-22 19:05:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@s8n You tried very hard to portray this good ...,,,,,666,10,,,,,
290,838150277551247360,8.381455e+17,21955060.0,2017-03-04 22:12:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@markhoppus 182/10,,,,,182,10,,,,,
313,835246439529840640,8.35246e+17,26259580.0,2017-02-24 21:54:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@jonnysun @Lin_Manuel ok jomny I know you're e...,,,,,960,0,,,,,
340,832215909146226688,,,2017-02-16 13:11:49 +0000,"<a href=""http://twitter.com/download/iphone"" r...","RT @dog_rates: This is Logan, the Chow who liv...",7.867091e+17,4196984000.0,2016-10-13 23:23:56 +0000,https://twitter.com/dog_rates/status/786709082...,75,10,Logan,,,,
433,820690176645140481,,,2017-01-15 17:52:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",The floofs have been released I repeat the flo...,,,,https://twitter.com/dog_rates/status/820690176...,84,70,,,,,
516,810984652412424192,,,2016-12-19 23:06:23 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Sam. She smiles 24/7 &amp; secretly aspir...,,,,"https://www.gofundme.com/sams-smile,https://tw...",24,7,Sam,,,,
695,786709082849828864,,,2016-10-13 23:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...","This is Logan, the Chow who lived. He solemnly...",,,,https://twitter.com/dog_rates/status/786709082...,75,10,Logan,,,,
763,778027034220126208,,,2016-09-20 00:24:34 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Sophie. She's a Jubilant Bush Pupper. ...,,,,https://twitter.com/dog_rates/status/778027034...,27,10,Sophie,,,pupper,
902,758467244762497024,,,2016-07-28 01:00:57 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Why does this never happen at my front door......,,,,https://twitter.com/dog_rates/status/758467244...,165,150,,,,,


In [99]:
##This is used to get names that are unusual
twitter[twitter['name'].apply(len) < 3]


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
56,881536004380872706,,,2017-07-02 15:32:16 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a pupper approaching maximum borkdrive...,,,,https://twitter.com/dog_rates/status/881536004...,14,10,a,,,pupper,
393,825876512159186944,,,2017-01-30 01:21:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Mo. No one will push him around in the...,,,,https://twitter.com/dog_rates/status/825876512...,11,10,Mo,,,,
446,819015337530290176,,,2017-01-11 02:57:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Bo. He was a very good ...,8.190048e+17,4.196984e+09,2017-01-11 02:15:36 +0000,https://twitter.com/dog_rates/status/819004803...,14,10,Bo,doggo,,,
449,819004803107983360,,,2017-01-11 02:15:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bo. He was a very good First Doggo. 14...,,,,https://twitter.com/dog_rates/status/819004803...,14,10,Bo,doggo,,,
553,804026241225523202,,,2016-11-30 18:16:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bo. He's going to make me cry. 13/10 p...,,,,https://twitter.com/dog_rates/status/804026241...,13,10,Bo,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2349,666051853826850816,,,2015-11-16 00:35:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is an odd dog. Hard on the outside but lo...,,,,https://twitter.com/dog_rates/status/666051853...,2,10,an,,,,
2350,666050758794694657,,,2015-11-16 00:30:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a truly beautiful English Wilson Staff...,,,,https://twitter.com/dog_rates/status/666050758...,10,10,a,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,


In [100]:
## To get the tweets that are original.
twitter[twitter['retweeted_status_id'].isnull()]


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


#### Quality
Completeness, Validity, Accuracy, Consistency

twitter dataset

in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id should be integers/strings rather than float.
retweeted_status_timestamp, timestamp should be datetime rather than object (string).
The numerator and denominator columns have invalid values.
In several columns null objects are non-null (None to NaN).
Name column have invalid names i.e 'None', 'a', 'an' and fewer than 3 characters.
We only want original ratings (no retweets) that have images.
We might want to vary this columns type (in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id and tweet_id) to string because we do not want any operations on them.
Sources difficult to read.
image_predictions dataset

Missing values from images dataset (2075 rows rather than 2356)
Some tweet_ids have an equivalent jpg_url
Some tweets are have 2 different tweet_id one redirect to the opposite (Dataset contains retweets)
tweet_data dataset

This tweet_id (666020888022790149) duplicated 8 times



### Tidiness
Untidy data 



### Cleaning
Here cleaning of data is done.

In [101]:
## Here ccopying is done
tweet_data_clean = tweet_data.copy()
twitter_clean = twitter.copy()
image_predictions_clean= image_predictions.copy()

#### Define
Adding image_predictions and tweet_info to twitter table.



#### code

In [102]:
## stages of cleaning is done
twitter_clean = pandas.merge(left=twitter_clean,
                                 right=tweet_data_clean, left_on='tweet_id', right_on='tweet_id', how='inner')

In [103]:
twitter_clean = twitter_clean.merge(image_predictions_clean, on='tweet_id', how='inner')

#### Test

In [104]:
##getting the info of cleaned data
twitter_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 459 entries, 0 to 458
Data columns (total 33 columns):
tweet_id                      459 non-null int64
in_reply_to_status_id         4 non-null float64
in_reply_to_user_id           4 non-null float64
timestamp                     459 non-null object
source                        459 non-null object
text                          459 non-null object
retweeted_status_id           34 non-null float64
retweeted_status_user_id      34 non-null float64
retweeted_status_timestamp    34 non-null object
expanded_urls                 459 non-null object
rating_numerator              459 non-null int64
rating_denominator            459 non-null int64
name                          459 non-null object
doggo                         459 non-null object
floofer                       459 non-null object
pupper                        459 non-null object
puppo                         459 non-null object
favorites                     459 non-null int64
re

#### Define
Melt the 'doggo', 'floofer', 'pupper' and 'puppo' columns into one column 'dog_stage'.



#### code

In [105]:
# getting melt cols to remain
MELTS_COLUMNS = ['doggo', 'floofer', 'pupper', 'puppo']
STAY_COLUMNS = [x for x in twitter_clean.columns.tolist() if x not in MELTS_COLUMNS]


# values are assigned to melt cols
twitter_clean = pandas.melt(twitter_clean, id_vars = STAY_COLUMNS, value_vars = MELTS_COLUMNS, 
                         var_name = 'stages', value_name = 'dog_stage')
# stages cols are deleted
twitter_clean = twitter_clean.drop('stages', 1)

#### Test

In [106]:
print(twitter_clean.dog_stage.value_counts())
print(len(twitter_clean))##To get the length

None       1745
doggo        40
pupper       35
puppo        14
floofer       2
Name: dog_stage, dtype: int64
1836


#### Clean
removing unusual rows and cols

#### code

In [107]:
# here the retweets are deleted
twitter_clean = twitter_clean[pandas.isnull(twitter_clean.retweeted_status_id)]

# the duplicated tweet_id is deleted
twitter_clean = twitter_clean.drop_duplicates()

# tweets which donot have any pictures are deleted
twitter_clean = twitter_clean.dropna(subset = ['jpg_url'])

# getting length
len(twitter_clean)

507

In [108]:
# cols that are related to tweets are deleted
twitter_clean = twitter_clean.drop('retweeted_status_id', 1)
twitter_clean = twitter_clean.drop('retweeted_status_user_id', 1)
twitter_clean = twitter_clean.drop('retweeted_status_timestamp', 1)

# col date_time is deleted
twitter_clean = twitter_clean.drop('date_time', 1)

# getting list
list(twitter_clean)

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'favorites',
 'retweets',
 'user_followers',
 'user_favourites',
 'jpg_url',
 'img_num',
 'p1',
 'p1_conf',
 'p1_dog',
 'p2',
 'p2_conf',
 'p2_dog',
 'p3',
 'p3_conf',
 'p3_dog',
 'dog_stage']

In [109]:
#dog_stage cols are deleted
twitter_clean = twitter_clean.sort_values('dog_stage').drop_duplicates('tweet_id', keep = 'last')


#### Test

In [110]:
print(twitter_clean.dog_stage.value_counts())
print(len(twitter_clean))##getting length

None       349
doggo       31
pupper      30
puppo       13
floofer      2
Name: dog_stage, dtype: int64
425


#### Define
 image prediction columns are removed

#### code

In [111]:
# storing first algorithm
prediction_algorithm = []
confidence_level = []


##prediction_confidence function is defined
def get_prediction_confidence(dataframe):
    if dataframe['p1_dog'] == True:
        prediction_algorithm.append(dataframe['p1'])
        confidence_level.append(dataframe['p1_conf'])
    elif dataframe['p2_dog'] == True:
        prediction_algorithm.append(dataframe['p2'])
        confidence_level.append(dataframe['p2_conf'])
    elif dataframe['p3_dog'] == True:
        prediction_algorithm.append(dataframe['p3'])
        confidence_level.append(dataframe['p3_conf'])
    else:
        prediction_algorithm.append('NaN')
        confidence_level.append(0)

twitter_clean.apply(get_prediction_confidence, axis=1)
twitter_clean['prediction_algorithm'] = prediction_algorithm ## getting prediction algorithm
twitter_clean['confidence_level'] = confidence_level ## getting confidence level

#### Test

In [112]:
list(twitter_clean) ## getting the list

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'favorites',
 'retweets',
 'user_followers',
 'user_favourites',
 'jpg_url',
 'img_num',
 'p1',
 'p1_conf',
 'p1_dog',
 'p2',
 'p2_conf',
 'p2_dog',
 'p3',
 'p3_conf',
 'p3_dog',
 'dog_stage',
 'prediction_algorithm',
 'confidence_level']

In [113]:
# image prediction cols are deleted
twitter_clean = twitter_clean.drop(['img_num', 'p1', 'p1_conf', 'p1_dog', 'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'], 1)
list(twitter_clean)

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'favorites',
 'retweets',
 'user_followers',
 'user_favourites',
 'jpg_url',
 'dog_stage',
 'prediction_algorithm',
 'confidence_level']

In [114]:
## giving concentration on values that are low
twitter_clean.info()
print('in_reply_to_user_id ')
print(twitter_clean['in_reply_to_user_id'].value_counts())
print('source ')
print(twitter_clean['source'].value_counts())
print('user_favourites ')
print(twitter_clean['user_favourites'].value_counts())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 425 entries, 0 to 1826
Data columns (total 18 columns):
tweet_id                 425 non-null int64
in_reply_to_status_id    4 non-null float64
in_reply_to_user_id      4 non-null float64
timestamp                425 non-null object
source                   425 non-null object
text                     425 non-null object
expanded_urls            425 non-null object
rating_numerator         425 non-null int64
rating_denominator       425 non-null int64
name                     425 non-null object
favorites                425 non-null int64
retweets                 425 non-null int64
user_followers           425 non-null int64
user_favourites          425 non-null int64
jpg_url                  425 non-null object
dog_stage                425 non-null object
prediction_algorithm     425 non-null object
confidence_level         425 non-null float64
dtypes: float64(3), int64(7), object(8)
memory usage: 63.1+ KB
in_reply_to_user_id 
4.196984

#### Notes
One value is in reply to user id, so the cols are deleted


In [115]:
# Here certain cols are dropped
twitter_clean = twitter_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'user_favourites'], 1)
# Cleaning
twitter_clean['source'] = twitter_clean['source'].apply(lambda x: re.findall(r'>(.*)<', x)[0])
twitter_clean


Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,favorites,retweets,user_followers,jpg_url,dog_stage,prediction_algorithm,confidence_level
0,892420643555336193,2017-08-01 16:23:56 +0000,Twitter for iPhone,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,36297,7725,8773685,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,,,0.000000
327,821886076407029760,2017-01-19 01:04:45 +0000,Twitter for iPhone,This is Jimison. He was just called a good boy...,https://twitter.com/dog_rates/status/821886076...,13,10,Jimison,11487,2352,8773692,https://pbs.twimg.com/media/C2ftAxnWIAEUdAR.jpg,,golden_retriever,0.266238
326,822244816520155136,2017-01-20 00:50:15 +0000,Twitter for iPhone,We only rate dogs. Please don't send pics of m...,https://twitter.com/dog_rates/status/822244816...,11,10,,35463,10048,8773692,https://pbs.twimg.com/media/C2kzTGxWEAEOpPL.jpg,,Samoyed,0.585441
324,822489057087389700,2017-01-20 17:00:46 +0000,Twitter for iPhone,This is Paisley. She really wanted to be presi...,https://twitter.com/dog_rates/status/822489057...,13,10,Paisley,18411,6515,8773692,https://pbs.twimg.com/media/C2oRbOuWEAAbVSl.jpg,,Samoyed,0.416769
323,822610361945911296,2017-01-21 01:02:48 +0000,Twitter for iPhone,Please stop sending in non-canines like this V...,https://twitter.com/dog_rates/status/822610361...,12,10,,15036,3007,8773692,https://pbs.twimg.com/media/C2p_wQyXEAELtvS.jpg,,cocker_spaniel,0.664487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1440,878776093423087618,2017-06-25 00:45:22 +0000,Twitter for iPhone,This is Snoopy. He's a proud #PrideMonthPuppo....,https://twitter.com/dog_rates/status/878776093...,13,10,Snoopy,18220,3748,8773688,https://pbs.twimg.com/media/DDIKMXzW0AEibje.jpg,puppo,Italian_greyhound,0.734684
1391,889531135344209921,2017-07-24 17:02:04 +0000,Twitter for iPhone,This is Stuart. He's sporting his favorite fan...,https://twitter.com/dog_rates/status/889531135...,13,10,Stuart,14212,2061,8774391,https://pbs.twimg.com/media/DFg_2PVW0AEHN3p.jpg,puppo,golden_retriever,0.953442
1389,889665388333682689,2017-07-25 01:55:32 +0000,Twitter for iPhone,Here's a puppo that seems to be on the fence a...,https://twitter.com/dog_rates/status/889665388...,13,10,,45134,9152,8774391,https://pbs.twimg.com/media/DFi579UWsAAatzw.jpg,puppo,Pembroke,0.966327
1514,859607811541651456,2017-05-03 03:17:27 +0000,Twitter for iPhone,Sorry for the lack of posts today. I came home...,https://twitter.com/dog_rates/status/859607811...,13,10,,17960,1488,8773689,https://pbs.twimg.com/media/C-3wvtxXcAUTuBE.jpg,puppo,golden_retriever,0.895529


#### Define
Fix rating numerator and denominators that are not actually ratings

#### code

In [116]:
# To get the occurences of #
text_ratings_to_fix = twitter_clean[twitter_clean.text.str.contains( r"(\d+\.?\d*\/\d+\.?\d*\D+\d+\.?\d*\/\d+\.?\d*)")].text

text_ratings_to_fix

Series([], Name: text, dtype: object)

In [117]:
for entry in text_ratings_to_fix:##for loop
    mask = twitter_archive_clean.text == entry
    column_name1 = 'rating_numerator'
    column_name2 = 'rating_denominator'
    twitter_clean.loc[mask, column_name1] = re.findall(r"\d+\.?\d*\/\d+\.?\d*\D+(\d+\.?\d*)\/\d+\.?\d*", entry)
    twitter_clean.loc[mask, column_name2] = 10

In [118]:
twitter_clean[twitter_clean.text.isin(text_ratings_to_fix)]


Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,favorites,retweets,user_followers,jpg_url,dog_stage,prediction_algorithm,confidence_level


#### Define
Fixing the numerator that have decimals.

#### code

In [119]:
# tweets with decimals are viewed
twitter_clean[twitter_clean.text.str.contains(r"(\d+\.\d*\/\d+)")]

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,favorites,retweets,user_followers,jpg_url,dog_stage,prediction_algorithm,confidence_level
40,883482846933004288,2017-07-08 00:28:19 +0000,Twitter for iPhone,This is Bella. She hopes her smile made you sm...,https://twitter.com/dog_rates/status/883482846...,5,10,Bella,43067,9036,8774392,https://pbs.twimg.com/media/DELC9dZXUAADqUk.jpg,,golden_retriever,0.943082


In [120]:
# Setting accurate numerators
twitter_clean.loc[(twitter_clean['tweet_id'] == 883482846933004288) & (twitter_clean['rating_numerator'] == 5), ['rating_numerator']] = 13.5
twitter_clean.loc[(twitter_clean['tweet_id'] == 786709082849828864) & (twitter_clean['rating_numerator'] == 75), ['rating_numerator']] = 9.75
twitter_clean.loc[(twitter_clean['tweet_id'] == 778027034220126208) & (twitter_clean['rating_numerator'] == 27), ['rating_numerator']] = 11.27
twitter_clean.loc[(twitter_clean['tweet_id'] == 680494726643068929) & (twitter_clean['rating_numerator'] == 26), ['rating_numerator']] = 11.26

#### Test

In [121]:
twitter_clean[twitter_clean.text.str.contains(r"(\d+\.\d*\/\d+)")]

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,favorites,retweets,user_followers,jpg_url,dog_stage,prediction_algorithm,confidence_level
40,883482846933004288,2017-07-08 00:28:19 +0000,Twitter for iPhone,This is Bella. She hopes her smile made you sm...,https://twitter.com/dog_rates/status/883482846...,13.5,10,Bella,43067,9036,8774392,https://pbs.twimg.com/media/DELC9dZXUAADqUk.jpg,,golden_retriever,0.943082


#### Define
To get gender of dogs

#### code

In [122]:
# here appending the list

male = ['He', 'he', 'him']
female = ['She', 'she', 'her']

gender = []

for text in twitter_clean['text']:
    # Male
    if any(map(lambda v:v in male, text.split())):
        gender.append('male')
    # Female
    elif any(map(lambda v:v in female, text.split())):
        gender.append('female')
    
    else:
        gender.append('NaN')

# getting length
len(gender)

# Saving the result
twitter_clean['gender'] = gender

#### Test

In [123]:
print("gender count \n", twitter_clean.gender.value_counts())

gender count 
 NaN       197
male      155
female     73
Name: gender, dtype: int64


#### Define
Converting the null values to None

#### code

In [124]:
twitter_clean.loc[twitter_clean['prediction_algorithm'] == 'NaN', 'prediction_algorithm'] = None
twitter_clean.loc[twitter_clean['gender'] == 'NaN', 'gender'] = None
twitter_clean.loc[twitter_clean['rating_numerator'] == 'NaN', 'rating_numerator'] = 0

#### Test

In [125]:
twitter_clean.info()##to get info

<class 'pandas.core.frame.DataFrame'>
Int64Index: 425 entries, 0 to 1826
Data columns (total 16 columns):
tweet_id                425 non-null int64
timestamp               425 non-null object
source                  425 non-null object
text                    425 non-null object
expanded_urls           425 non-null object
rating_numerator        425 non-null float64
rating_denominator      425 non-null int64
name                    425 non-null object
favorites               425 non-null int64
retweets                425 non-null int64
user_followers          425 non-null int64
jpg_url                 425 non-null object
dog_stage               425 non-null object
prediction_algorithm    382 non-null object
confidence_level        425 non-null float64
gender                  228 non-null object
dtypes: float64(2), int64(5), object(9)
memory usage: 56.4+ KB


#### Define
Datatypes are changed



#### code

In [126]:
##To change the data types
twitter_clean['tweet_id'] = twitter_clean['tweet_id'].astype(str)
twitter_clean['timestamp'] = pandas.to_datetime(twitter_clean.timestamp)
twitter_clean['source'] = twitter_clean['source'].astype('category')
twitter_clean['favorites'] = twitter_clean['favorites'].astype(int)
twitter_clean['retweets'] = twitter_clean['retweets'].astype(int)
twitter_clean['user_followers'] = twitter_clean['user_followers'].astype(int)
twitter_clean['dog_stage'] = twitter_clean['dog_stage'].astype('category')
twitter_clean['rating_numerator'] = twitter_clean['rating_numerator'].astype(float)
twitter_clean['rating_denominator'] = twitter_clean['rating_denominator'].astype(float)
twitter_clean['gender'] = twitter_clean['gender'].astype('category')

#### Test

In [127]:
twitter_clean.dtypes

tweet_id                             object
timestamp               datetime64[ns, UTC]
source                             category
text                                 object
expanded_urls                        object
rating_numerator                    float64
rating_denominator                  float64
name                                 object
favorites                             int32
retweets                              int32
user_followers                        int32
jpg_url                              object
dog_stage                          category
prediction_algorithm                 object
confidence_level                    float64
gender                             category
dtype: object

#### Store

In [128]:

twitter_clean.drop(twitter_clean.columns[twitter_clean.columns.str.contains('Unnamed',case = False)],axis = 1)
twitter_clean.to_csv('twitter_master.csv', encoding = 'utf-8', index=False)
twitter_clean = pandas.read_csv('twitter_master.csv')
twitter_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 425 entries, 0 to 424
Data columns (total 16 columns):
tweet_id                425 non-null int64
timestamp               425 non-null object
source                  425 non-null object
text                    425 non-null object
expanded_urls           425 non-null object
rating_numerator        425 non-null float64
rating_denominator      425 non-null float64
name                    425 non-null object
favorites               425 non-null int64
retweets                425 non-null int64
user_followers          425 non-null int64
jpg_url                 425 non-null object
dog_stage               425 non-null object
prediction_algorithm    382 non-null object
confidence_level        425 non-null float64
gender                  228 non-null object
dtypes: float64(3), int64(4), object(9)
memory usage: 53.2+ KB


visualizations is given in the act_report.ipynd notebook