# Wrangling and Analyzing WeRateDogs Twitter Account (DAND project)
Abdullah Almuzaini

## Introduction 


##  Data Gathering

> Gathering the Pieces of datasets required for the project. In this case, three datasets will be fathered and dowlnloaded into the notebook
> The three datasets are:
>    - WeRateDogs twitter account archive, which is hosted on Udacity server on this link. 

#### Importing the necessary libraries

In [1]:
import requests
import os
import pandas as pd
import tweepy
import json
from tqdm import tqdm
pd.set_option('display.max_rows', None, 'display.max_columns',None)
import numpy as np
import re

### Below cell contains two functions:
> - new_folder() to create new folder
> - get_files() to download files from the internet programmticly 




In [2]:

def new_folder(folder_name): 
    """
    this funcation creates a new directory in the currnet running directory when it's called
    """
    if not os.path.exists(folder_name):
        os.makedirs(folder_name)


def get_files(file_name, url):
    """
    The following function download any file from the internet using requests library and save it in the 'dataset' directory. 
    note about the file_name: 
    the file_name var has to include the file format


    """ 
    response = requests.get(url)
    with open(os.path.join('dataset', url.split('/')[-1]), mode='wb') as file:
        file.write(response.content)


In [3]:
# Create a new folder if it does not exist 
new_folder('dataset')

In [4]:
URL = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv'

data = get_files(file_name = "twitter-archive-enhanced.csv", url= URL  )
df = pd.read_csv('dataset'+ '/'+'twitter-archive-enhanced.csv', parse_dates=True)
df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [5]:
# The URL to `image-predictions.tsv` dataset

URL = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# Download the dataset 

data = get_files(file_name = "image-predictions.tsv", url= URL)

# Read the dataset in pandas DataFrame
img_pred = pd.read_csv('dataset/image-predictions.tsv', sep='\t')
img_pred.head()


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [6]:
df.shape

(2356, 17)

### Setting up tweetpy 

In [7]:
# Twitter Credentials

ACCESS_TOKEN = ""  
ACCESS_TOKEN_SECRET = ""  
CONSUMER_KEY = "" 
CONSUMER_SECRET = ""

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [7]:
# Test the api
tweet = api.get_status(677187300187611136)
print(tweet.user.screen_name)

> #### The following block of code will collect three pieces of data about each tweet
> and these pieces of data are:
>    - tweet_id (will be brought from 'twitter-archive-enhanced.csv'
>    - retweet_count (will be collected using tweetpy.api)
>    - favorite_count (will be collected using tweetpy.api)

> #### These pieces of data will be added to a dict var in order to save them in json format


In [58]:
data = {}
data['tweet_info']=  []
for i in tqdm(df['tweet_id']):
    try:
        tweet = api.get_status(i)
        data['tweet_info'].append({
            'tweet_id' : i ,
            'retweet_count': tweet.retweet_count ,
            'likes_count' : tweet.favorite_count
        })
    except:
        print("The tweet associated to the tweet id", i," was not found")


  1%|          | 21/2356 [00:04<09:14,  4.21it/s]

The tweet associated to the tweet id 888202515573088257  was not found


  4%|▍         | 97/2356 [00:18<07:07,  5.28it/s]

The tweet associated to the tweet id 873697596434513921  was not found


  4%|▍         | 103/2356 [00:19<06:31,  5.75it/s]

The tweet associated to the tweet id 872668790621863937  was not found


  4%|▍         | 106/2356 [00:19<06:12,  6.05it/s]

The tweet associated to the tweet id 872261713294495745  was not found


  5%|▌         | 120/2356 [00:22<07:00,  5.32it/s]

The tweet associated to the tweet id 869988702071779329  was not found


  6%|▌         | 134/2356 [00:24<06:46,  5.47it/s]

The tweet associated to the tweet id 866816280283807744  was not found


  7%|▋         | 157/2356 [00:28<06:01,  6.08it/s]

The tweet associated to the tweet id 861769973181624320  was not found


  8%|▊         | 183/2356 [00:35<13:02,  2.78it/s]

The tweet associated to the tweet id 856602993587888130  was not found


  8%|▊         | 187/2356 [00:36<11:39,  3.10it/s]

The tweet associated to the tweet id 856330835276025856  was not found


  9%|▉         | 213/2356 [00:41<06:08,  5.82it/s]

The tweet associated to the tweet id 851953902622658560  was not found


 11%|█         | 249/2356 [00:47<06:38,  5.28it/s]

The tweet associated to the tweet id 845459076796616705  was not found


 11%|█         | 255/2356 [00:48<05:42,  6.14it/s]

The tweet associated to the tweet id 844704788403113984  was not found


 11%|█         | 262/2356 [00:49<05:55,  5.89it/s]

The tweet associated to the tweet id 842892208864923648  was not found


 13%|█▎        | 298/2356 [00:56<05:36,  6.11it/s]

The tweet associated to the tweet id 837366284874571778  was not found


 13%|█▎        | 300/2356 [00:56<05:45,  5.94it/s]

The tweet associated to the tweet id 837012587749474308  was not found


 15%|█▌        | 365/2356 [01:08<05:28,  6.06it/s]

The tweet associated to the tweet id 829374341691346946  was not found


 16%|█▋        | 384/2356 [01:12<06:34,  5.00it/s]

The tweet associated to the tweet id 827228250799742977  was not found


 22%|██▏       | 508/2356 [01:34<05:08,  5.98it/s]

The tweet associated to the tweet id 812747805718642688  was not found


 24%|██▍       | 568/2356 [01:46<04:54,  6.08it/s]

The tweet associated to the tweet id 802247111496568832  was not found


 32%|███▏      | 752/2356 [02:19<04:31,  5.90it/s]

The tweet associated to the tweet id 779123168116150273  was not found


 33%|███▎      | 786/2356 [02:26<05:02,  5.19it/s]

The tweet associated to the tweet id 775096608509886464  was not found


 35%|███▍      | 817/2356 [02:31<04:15,  6.02it/s]

The tweet associated to the tweet id 771004394259247104  was not found


 35%|███▍      | 820/2356 [02:32<05:09,  4.96it/s]

The tweet associated to the tweet id 770743923962707968  was not found


 36%|███▌      | 843/2356 [02:36<04:20,  5.80it/s]

The tweet associated to the tweet id 766864461642756096  was not found


 38%|███▊      | 889/2356 [02:44<04:10,  5.85it/s]

The tweet associated to the tweet id 759923798737051648  was not found


 38%|███▊      | 892/2356 [02:44<04:12,  5.80it/s]

The tweet associated to the tweet id 759566828574212096  was not found


 38%|███▊      | 900/2356 [02:46<04:40,  5.19it/s]Rate limit reached. Sleeping for: 734
 40%|███▉      | 934/2356 [15:11<04:07,  5.74it/s]    

The tweet associated to the tweet id 754011816964026368  was not found


 73%|███████▎  | 1727/2356 [17:37<01:41,  6.20it/s]

The tweet associated to the tweet id 680055455951884288  was not found


 76%|███████▋  | 1800/2356 [17:50<01:44,  5.32it/s]Rate limit reached. Sleeping for: 735
100%|██████████| 2356/2356 [31:53<00:00,  1.23it/s]    


In [59]:
# Create a text file in the current directory and save each tweet's retweet count and favorite ("like") count
# in text file as json format

with open('tweet_json1.txt', 'w') as outfile:
    json.dump(data, outfile)

> ### Below, I will load "tweet_json.txt" into the notebook and print it out to see how it looks

In [8]:
# Read 'tweet.txt' line by line
with open('tweet_json.txt') as json_file:
    data = json.load(json_file)
    for p in data['tweet_info']:
        print('Tweet id: ' , p['tweet_id'])
        print('Retweet count: ' , p['retweet_count'])
        print('Likes count: ' , p['likes_count'])
        print('')

Tweet id:  892420643555336193
Retweet count:  7189
Likes count:  34531

Tweet id:  892177421306343426
Retweet count:  5396
Likes count:  29940

Tweet id:  891815181378084864
Retweet count:  3562
Likes count:  22530

Tweet id:  891689557279858688
Retweet count:  7413
Likes count:  37769

Tweet id:  891327558926688256
Retweet count:  7947
Likes count:  36071

Tweet id:  891087950875897856
Retweet count:  2672
Likes count:  18197

Tweet id:  890971913173991426
Retweet count:  1715
Likes count:  10603

Tweet id:  890729181411237888
Retweet count:  16176
Likes count:  58118

Tweet id:  890609185150312448
Retweet count:  3712
Likes count:  25055

Tweet id:  890240255349198849
Retweet count:  6247
Likes count:  28574

Tweet id:  890006608113172480
Retweet count:  6296
Likes count:  27595

Tweet id:  889880896479866881
Retweet count:  4274
Likes count:  25085

Tweet id:  889665388333682689
Retweet count:  8560
Likes count:  42978

Tweet id:  889638837579907072
Retweet count:  3818
Likes count:

Likes count:  0

Tweet id:  834458053273591808
Retweet count:  1583
Likes count:  9232

Tweet id:  834209720923721728
Retweet count:  4461
Likes count:  19682

Tweet id:  834167344700198914
Retweet count:  3464
Likes count:  15191

Tweet id:  834089966724603904
Retweet count:  2016
Likes count:  9641

Tweet id:  834086379323871233
Retweet count:  2081
Likes count:  12521

Tweet id:  833863086058651648
Retweet count:  2266
Likes count:  12920

Tweet id:  833826103416520705
Retweet count:  3563
Likes count:  17440

Tweet id:  833732339549220864
Retweet count:  209
Likes count:  0

Tweet id:  833722901757046785
Retweet count:  3068
Likes count:  20099

Tweet id:  833479644947025920
Retweet count:  1932
Likes count:  14316

Tweet id:  833124694597443584
Retweet count:  4574
Likes count:  19438

Tweet id:  832998151111966721
Retweet count:  2072
Likes count:  12722

Tweet id:  832769181346996225
Retweet count:  38
Likes count:  0

Tweet id:  832757312314028032
Retweet count:  3418
Likes cou


Tweet id:  806629075125202948
Retweet count:  33218
Likes count:  72307

Tweet id:  806620845233815552
Retweet count:  5297
Likes count:  0

Tweet id:  806576416489959424
Retweet count:  1881
Likes count:  4755

Tweet id:  806542213899489280
Retweet count:  2275
Likes count:  9931

Tweet id:  806242860592926720
Retweet count:  11162
Likes count:  0

Tweet id:  806219024703037440
Retweet count:  1157
Likes count:  6279

Tweet id:  805958939288408065
Retweet count:  5146
Likes count:  0

Tweet id:  805932879469572096
Retweet count:  1852
Likes count:  8080

Tweet id:  805826884734976000
Retweet count:  1753
Likes count:  6389

Tweet id:  805823200554876929
Retweet count:  7700
Likes count:  0

Tweet id:  805520635690676224
Retweet count:  1596
Likes count:  5599

Tweet id:  805487436403003392
Retweet count:  2434
Likes count:  8555

Tweet id:  805207613751304193
Retweet count:  1666
Likes count:  7637

Tweet id:  804738756058218496
Retweet count:  3687
Likes count:  13294

Tweet id:  80

Likes count:  5183

Tweet id:  768970937022709760
Retweet count:  6162
Likes count:  13713

Tweet id:  768909767477751808
Retweet count:  2530
Likes count:  0

Tweet id:  768855141948723200
Retweet count:  849
Likes count:  4027

Tweet id:  768609597686943744
Retweet count:  1131
Likes count:  3942

Tweet id:  768596291618299904
Retweet count:  1217
Likes count:  4844

Tweet id:  768554158521745409
Retweet count:  5523
Likes count:  0

Tweet id:  768473857036525572
Retweet count:  3244
Likes count:  13082

Tweet id:  768193404517830656
Retweet count:  3358
Likes count:  10456

Tweet id:  767884188863397888
Retweet count:  1295
Likes count:  4561

Tweet id:  767754930266464257
Retweet count:  5097
Likes count:  15356

Tweet id:  767500508068192258
Retweet count:  2214
Likes count:  7209

Tweet id:  767191397493538821
Retweet count:  3582
Likes count:  11821

Tweet id:  767122157629476866
Retweet count:  2720
Likes count:  9775

Tweet id:  766793450729734144
Retweet count:  1287
Likes co

Likes count:  17368

Tweet id:  743545585370791937
Retweet count:  887
Likes count:  3356

Tweet id:  743510151680958465
Retweet count:  3502
Likes count:  7563

Tweet id:  743253157753532416
Retweet count:  1147
Likes count:  4024

Tweet id:  743222593470234624
Retweet count:  1795
Likes count:  5882

Tweet id:  743210557239623680
Retweet count:  1276
Likes count:  3643

Tweet id:  742534281772302336
Retweet count:  3363
Likes count:  6702

Tweet id:  742528092657332225
Retweet count:  1841
Likes count:  4197

Tweet id:  742465774154047488
Retweet count:  3693
Likes count:  6893

Tweet id:  742423170473463808
Retweet count:  3512
Likes count:  9324

Tweet id:  742385895052087300
Retweet count:  1836
Likes count:  6449

Tweet id:  742161199639494656
Retweet count:  1274
Likes count:  4083

Tweet id:  742150209887731712
Retweet count:  1447
Likes count:  4834

Tweet id:  741793263812808706
Retweet count:  1411
Likes count:  4307

Tweet id:  741743634094141440
Retweet count:  2580
Likes 

Retweet count:  1069
Likes count:  3002

Tweet id:  705475953783398401
Retweet count:  854
Likes count:  2818

Tweet id:  705442520700944385
Retweet count:  1501
Likes count:  4165

Tweet id:  705428427625635840
Retweet count:  1562
Likes count:  3606

Tweet id:  705239209544720384
Retweet count:  703
Likes count:  2841

Tweet id:  705223444686888960
Retweet count:  745
Likes count:  2444

Tweet id:  705102439679201280
Retweet count:  486
Likes count:  2030

Tweet id:  705066031337840642
Retweet count:  570
Likes count:  2069

Tweet id:  704871453724954624
Retweet count:  1050
Likes count:  4050

Tweet id:  704859558691414016
Retweet count:  497
Likes count:  2137

Tweet id:  704847917308362754
Retweet count:  1416
Likes count:  4812

Tweet id:  704819833553219584
Retweet count:  917
Likes count:  2501

Tweet id:  704761120771465216
Retweet count:  2721
Likes count:  6294

Tweet id:  704499785726889984
Retweet count:  934
Likes count:  2745

Tweet id:  704491224099647488
Retweet count:

Likes count:  2133

Tweet id:  689283819090870273
Retweet count:  1032
Likes count:  3090

Tweet id:  689280876073582592
Retweet count:  661
Likes count:  1870

Tweet id:  689275259254616065
Retweet count:  233
Likes count:  1087

Tweet id:  689255633275777024
Retweet count:  1011
Likes count:  2417

Tweet id:  689154315265683456
Retweet count:  925
Likes count:  2925

Tweet id:  689143371370250240
Retweet count:  468
Likes count:  1914

Tweet id:  688916208532455424
Retweet count:  804
Likes count:  2563

Tweet id:  688908934925697024
Retweet count:  719
Likes count:  1990

Tweet id:  688898160958271489
Retweet count:  731
Likes count:  1983

Tweet id:  688894073864884227
Retweet count:  641
Likes count:  2126

Tweet id:  688828561667567616
Retweet count:  342
Likes count:  1306

Tweet id:  688804835492233216
Retweet count:  176
Likes count:  913

Tweet id:  688789766343622656
Retweet count:  623
Likes count:  2102

Tweet id:  688547210804498433
Retweet count:  657
Likes count:  2455


Retweet count:  950
Likes count:  2103

Tweet id:  676146341966438401
Retweet count:  609
Likes count:  1791

Tweet id:  676121918416756736
Retweet count:  1068
Likes count:  2010

Tweet id:  676101918813499392
Retweet count:  1078
Likes count:  2641

Tweet id:  676098748976615425
Retweet count:  1293
Likes count:  2760

Tweet id:  676089483918516224
Retweet count:  404
Likes count:  1218

Tweet id:  675898130735476737
Retweet count:  521
Likes count:  1494

Tweet id:  675891555769696257
Retweet count:  782
Likes count:  1976

Tweet id:  675888385639251968
Retweet count:  867
Likes count:  2195

Tweet id:  675878199931371520
Retweet count:  1282
Likes count:  3873

Tweet id:  675870721063669760
Retweet count:  518
Likes count:  1513

Tweet id:  675853064436391936
Retweet count:  1176
Likes count:  2498

Tweet id:  675849018447167488
Retweet count:  129
Likes count:  883

Tweet id:  675845657354215424
Retweet count:  801
Likes count:  2104

Tweet id:  675822767435051008
Retweet count:  


Tweet id:  671159727754231808
Retweet count:  72
Likes count:  323

Tweet id:  671154572044468225
Retweet count:  197
Likes count:  640

Tweet id:  671151324042559489
Retweet count:  137
Likes count:  598

Tweet id:  671147085991960577
Retweet count:  199
Likes count:  597

Tweet id:  671141549288370177
Retweet count:  587
Likes count:  1048

Tweet id:  671138694582165504
Retweet count:  358
Likes count:  839

Tweet id:  671134062904504320
Retweet count:  170
Likes count:  670

Tweet id:  671122204919246848
Retweet count:  2262
Likes count:  3163

Tweet id:  671115716440031232
Retweet count:  697
Likes count:  1245

Tweet id:  671109016219725825
Retweet count:  392
Likes count:  1039

Tweet id:  670995969505435648
Retweet count:  256
Likes count:  996

Tweet id:  670842764863651840
Retweet count:  7791
Likes count:  22611

Tweet id:  670840546554966016
Retweet count:  171
Likes count:  532

Tweet id:  670838202509447168
Retweet count:  614
Likes count:  1011

Tweet id:  67083381285993

Tweet id:  666345417576210432
Retweet count:  127
Likes count:  256

Tweet id:  666337882303524864
Retweet count:  80
Likes count:  169

Tweet id:  666293911632134144
Retweet count:  301
Likes count:  440

Tweet id:  666287406224695296
Retweet count:  57
Likes count:  126

Tweet id:  666273097616637952
Retweet count:  69
Likes count:  153

Tweet id:  666268910803644416
Retweet count:  32
Likes count:  90

Tweet id:  666104133288665088
Retweet count:  5626
Likes count:  12922

Tweet id:  666102155909144576
Retweet count:  11
Likes count:  67

Tweet id:  666099513787052032
Retweet count:  55
Likes count:  137

Tweet id:  666094000022159362
Retweet count:  65
Likes count:  146

Tweet id:  666082916733198337
Retweet count:  39
Likes count:  97

Tweet id:  666073100786774016
Retweet count:  136
Likes count:  277

Tweet id:  666071193221509120
Retweet count:  51
Likes count:  130

Tweet id:  666063827256086533
Retweet count:  185
Likes count:  417

Tweet id:  666058600524156928
Retweet count

> #### Convert 'tweet_json.txt' to a pandas dataframe 

In [9]:
# Read 'tweet.txt' line by line into a pandas dataframe with tweet_id, retweet_count, and likes_count
tweet_json_df = pd.json_normalize(data['tweet_info'])


In [10]:
tweet_json_df.head()

Unnamed: 0,tweet_id,retweet_count,likes_count
0,892420643555336193,7189,34531
1,892177421306343426,5396,29940
2,891815181378084864,3562,22530
3,891689557279858688,7413,37769
4,891327558926688256,7947,36071



## Assessing Data


### Assessing WeRateDogs twitter account archive table



In [11]:
df.sample(20)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1883,674805413498527744,,,2015-12-10 04:18:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When your entire life is crumbling before you ...,,,,https://twitter.com/dog_rates/status/674805413...,10,10,,,,,
1631,684481074559381504,,,2016-01-05 21:06:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Pippa. She's an Elfin High Feta. Compact ...,,,,https://twitter.com/dog_rates/status/684481074...,10,10,Pippa,,,,
892,759447681597108224,,,2016-07-30 17:56:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Oakley. He has no idea what happened h...,,,,https://twitter.com/dog_rates/status/759447681...,11,10,Oakley,,,,
461,817536400337801217,,,2017-01-07 01:00:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Eugene &amp; Patti Melt. No matte...,,,,https://twitter.com/dog_rates/status/817536400...,12,10,Eugene,,,,
1025,746369468511756288,,,2016-06-24 15:48:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is an Iraqi Speed Kangaroo. It is not a d...,,,,https://twitter.com/dog_rates/status/746369468...,9,10,an,,,,
1191,717841801130979328,,,2016-04-06 22:29:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Barclay. His father was a banana. 11/1...,,,,https://twitter.com/dog_rates/status/717841801...,11,10,Barclay,,,,
779,775842724423557120,,,2016-09-13 23:44:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Blue. He was having an average day unt...,,,,https://twitter.com/dog_rates/status/775842724...,12,10,Blue,,,,
2127,670319130621435904,,,2015-11-27 19:11:49 +0000,"<a href=""http://twitter.com/download/iphone"" r...",AT DAWN...\nWE RIDE\n\n11/10 https://t.co/QnfO...,,,,https://twitter.com/dog_rates/status/670319130...,11,10,,,,,
1230,713411074226274305,,,2016-03-25 17:03:49 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we see an extremely rare Bearded Floofmal...,,,,https://twitter.com/dog_rates/status/713411074...,11,10,,,,,
600,798673117451325440,,,2016-11-15 23:44:44 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: I shall call him squishy and he...,6.755011e+17,4196984000.0,2015-12-12 02:23:01 +0000,https://twitter.com/dog_rates/status/675501075...,13,10,,,,,


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [13]:
df.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [14]:
df[df['text'].str.contains('RT')].shape[0]

192

In [15]:
df[df.tweet_id ==710609963652087808][['tweet_id','expanded_urls','source']]

Unnamed: 0,tweet_id,expanded_urls,source
1255,710609963652087808,https://vine.co/v/idaTpwH5TgU,"<a href=""http://vine.co"" rel=""nofollow"">Vine -..."


In [16]:
df.name.unique()

array(['Phineas', 'Tilly', 'Archie', 'Darla', 'Franklin', 'None', 'Jax',
       'Zoey', 'Cassie', 'Koda', 'Bruno', 'Ted', 'Stuart', 'Oliver',
       'Jim', 'Zeke', 'Ralphus', 'Canela', 'Gerald', 'Jeffrey', 'such',
       'Maya', 'Mingus', 'Derek', 'Roscoe', 'Waffles', 'Jimbo', 'Maisey',
       'Lilly', 'Earl', 'Lola', 'Kevin', 'Yogi', 'Noah', 'Bella',
       'Grizzwald', 'Rusty', 'Gus', 'Stanley', 'Alfy', 'Koko', 'Rey',
       'Gary', 'a', 'Elliot', 'Louis', 'Jesse', 'Romeo', 'Bailey',
       'Duddles', 'Jack', 'Emmy', 'Steven', 'Beau', 'Snoopy', 'Shadow',
       'Terrance', 'Aja', 'Penny', 'Dante', 'Nelly', 'Ginger', 'Benedict',
       'Venti', 'Goose', 'Nugget', 'Cash', 'Coco', 'Jed', 'Sebastian',
       'Walter', 'Sierra', 'Monkey', 'Harry', 'Kody', 'Lassie', 'Rover',
       'Napolean', 'Dawn', 'Boomer', 'Cody', 'Rumble', 'Clifford',
       'quite', 'Dewey', 'Scout', 'Gizmo', 'Cooper', 'Harold', 'Shikha',
       'Jamesy', 'Lili', 'Sammy', 'Meatball', 'Paisley', 'Albus',
       'Nept

In [17]:
print(df.duplicated(subset=['expanded_urls']).sum())


137


In [18]:
for i in range(len(df[:1])):
    print(df.text.loc[i])

This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU



#### Qaulity

- Missing values in cloumns `reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`, `doggo`, `floofer`, `pupper`, `puppo`, `expanded_urls`
- `text` comlumn contains retweets 
- `name` comlumn cintains Inappropriate names ["a", "the", "None",'O','all','old']
- Incorrect `expanded_urls` for `tweet_id` 812503143955202048 and 710609963652087808
- Incorrect `source`,`expanded_urls` for `tweet_id` 710609963652087808
- imporper datatypes (`in_reply_to_status_id`, `in_reply_to_user_id`, `timestamp`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`)
- Null values are represented as 'None' instead of 'NaN'
- Some tweet ids have invalid `rating_numerator` and `rating_denominator`
- There 137 duplicates

#### Tidiness

- Since we are only interested in looking at orginal tweets, we do not need the following variables(`retweeted_status_id`,`retweeted_status_user_id`,`retweeted_status_timestamp`)
- Some recordes in `source` contain has html tags
- Some tweets' texts in `text` column begain with "RT" followed by a twitter account. 
- `doggo` `floofer` `pupper` `puppo`should be in one Column
- Tweet texts contain URLs at the end

### Assessing `img_pred` table

In [19]:
img_pred.sample(20)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1074,717009362452090881,https://pbs.twimg.com/media/CfNUNetW8AAekHx.jpg,1,Siberian_husky,0.506154,True,Eskimo_dog,0.269656,True,malamute,0.060658,True
1734,821522889702862852,https://pbs.twimg.com/media/C2aitIUXAAAG-Wi.jpg,1,Doberman,0.763539,True,black-and-tan_coonhound,0.136602,True,miniature_pinscher,0.087654,True
1754,824775126675836928,https://pbs.twimg.com/media/C3Iwlr0WYAARVh4.jpg,1,Border_terrier,0.610499,True,malinois,0.090291,True,Airedale,0.068625,True
1130,728035342121635841,https://pbs.twimg.com/media/ChqARqmWsAEI6fB.jpg,1,handkerchief,0.302961,False,Pomeranian,0.248664,True,Shih-Tzu,0.111015,True
261,670786190031921152,https://pbs.twimg.com/media/CU8ceuxWUAALMEo.jpg,1,dingo,0.777124,False,Pembroke,0.127438,True,Cardigan,0.024007,True
1257,748575535303884801,https://pbs.twimg.com/media/CmN5ecNWMAE6pnf.jpg,1,muzzle,0.176172,False,seat_belt,0.160953,False,soft-coated_wheaten_terrier,0.086499,True
1223,744709971296780288,https://pbs.twimg.com/media/ClW9w7mWEAEFN1k.jpg,1,Shetland_sheepdog,0.234431,True,Samoyed,0.114876,True,collie,0.086614,True
1955,864279568663928832,https://pbs.twimg.com/media/C_6JrWZVwAAHhCD.jpg,1,bull_mastiff,0.668613,True,French_bulldog,0.180562,True,Staffordshire_bullterrier,0.052237,True
2023,881536004380872706,https://pbs.twimg.com/ext_tw_video_thumb/88153...,1,Samoyed,0.281463,True,Angora,0.272066,False,Persian_cat,0.114854,False
636,681242418453299201,https://pbs.twimg.com/media/CXRCXesVAAArSXt.jpg,1,motor_scooter,0.255934,False,rifle,0.145202,False,assault_rifle,0.097,False


In [20]:
img_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [21]:
pd.set_option("display.max_rows", None, "display.max_columns", None)

img_pred['p1'].value_counts()

golden_retriever                  150
Labrador_retriever                100
Pembroke                           89
Chihuahua                          83
pug                                57
chow                               44
Samoyed                            43
toy_poodle                         39
Pomeranian                         38
cocker_spaniel                     30
malamute                           30
French_bulldog                     26
miniature_pinscher                 23
Chesapeake_Bay_retriever           23
seat_belt                          22
Siberian_husky                     20
Staffordshire_bullterrier          20
German_shepherd                    20
web_site                           19
Cardigan                           19
Shetland_sheepdog                  18
beagle                             18
Maltese_dog                        18
Eskimo_dog                         18
teddy                              18
Shih-Tzu                           17
Rottweiler  

In [22]:
img_pred.query('tweet_id ==671547767500775424').jpg_url

312    https://pbs.twimg.com/media/CVHRIiqWEAAj98K.jpg
Name: jpg_url, dtype: object

In [23]:
img_pred.isnull().sum()

tweet_id    0
jpg_url     0
img_num     0
p1          0
p1_conf     0
p1_dog      0
p2          0
p2_conf     0
p2_dog      0
p3          0
p3_conf     0
p3_dog      0
dtype: int64

In [24]:
img_pred.jpg_url.duplicated( keep='first').sum()


66

#### Qaulity 
- Underscore sign spearates words in (`p1`, `p2`,`p3`) columns.
- Captalization of Breads of dogs' types in (`p1`, `p2`,`p3`) columns are not consistant. 
- The predictions of dog breed for `tweet_id`s (717790033953034240,675135153782571009,671547767500775424)  are not correct in the three algorithms 
- The dog in image from `tweet_id` 669015743032369152 is not real dog 
- 66 duplicates in `jpg_url` column


### Assessing `tweet_json_df`  table



In [25]:
tweet_json_df.head()

Unnamed: 0,tweet_id,retweet_count,likes_count
0,892420643555336193,7189,34531
1,892177421306343426,5396,29940
2,891815181378084864,3562,22530
3,891689557279858688,7413,37769
4,891327558926688256,7947,36071


# Cleaning
### Define
- Merge the `df` &`img_pred` & `tweet_json_df` into one master dataframe named `master_df`
- Create a copy of the master dataset

### Code

In [26]:
#  merging the three datasets in one master dataset

master_df = pd.merge(df, img_pred, how = 'left', on = ['tweet_id'] )
master_df = pd.merge(master_df,tweet_json_df, how = 'left', on =['tweet_id'] )
master_df.to_csv('dataset'+'/'+'master_df.csv', encoding='utf-8', index =False)
master_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,retweet_count,likes_count
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1.0,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False,7189.0,34531.0
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1.0,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True,5396.0,29940.0
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1.0,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True,3562.0,22530.0
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1.0,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False,7413.0,37769.0
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2.0,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True,7947.0,36071.0


In [27]:
# Test the dataset file
print(pd.read_csv('dataset/master_df.csv').shape)
pd.read_csv('dataset/master_df.csv').head(1)


(2356, 30)


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,retweet_count,likes_count
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1.0,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False,7189.0,34531.0


In [28]:
# Create a copy dataset from master_df to preserve the original dataset unchanged

master_copy = master_df.copy()
master_copy.shape

(2356, 30)

## Tidiness 


##### Define
- Remove variables(`retweeted_status_id`,`retweeted_status_user_id`,`retweeted_status_timestamp`,`in_reply_to_status_id`,`in_reply_to_user_id`)

#### Code

In [29]:
master_copy.drop(columns=['retweeted_status_id', 'retweeted_status_user_id','retweeted_status_timestamp','in_reply_to_status_id','in_reply_to_user_id'], inplace = True)

#### Test

In [30]:
master_copy.columns

Index(['tweet_id', 'timestamp', 'source', 'text', 'expanded_urls',
       'rating_numerator', 'rating_denominator', 'name', 'doggo', 'floofer',
       'pupper', 'puppo', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog',
       'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog', 'retweet_count',
       'likes_count'],
      dtype='object')

#### Define 
- Remove the HTML tags from `source` column and extract the tweet's source 


#### Code

In [31]:
master_copy.source[4]

'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'

In [32]:
sources = []
for source in master_copy.source:
    sources.append(re.findall(r'>(.*)<',source)[0])
master_copy.source = sources

In [33]:
master_copy.sample(5)

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,retweet_count,likes_count
2284,667192066997374976,2015-11-19 04:05:59 +0000,Twitter for iPhone,*takes several long deep breaths* omg omg oMG ...,https://twitter.com/dog_rates/status/667192066...,12,10,,,,,,https://pbs.twimg.com/media/CUJXpRBXIAAN0yz.jpg,1.0,Rottweiler,0.28364,True,miniature_pinscher,0.148112,True,black-and-tan_coonhound,0.095585,True,87.0,341.0
25,887101392804085760,2017-07-18 00:07:08 +0000,Twitter for iPhone,This... is a Jubilant Antarctic House Bear. We...,https://twitter.com/dog_rates/status/887101392...,12,10,,,,,,https://pbs.twimg.com/media/DE-eAq6UwAA-jaE.jpg,1.0,Samoyed,0.733942,True,Eskimo_dog,0.035029,True,Staffordshire_bullterrier,0.029705,True,5106.0,27540.0
785,775085132600442880,2016-09-11 21:34:30 +0000,Twitter for iPhone,This is Tucker. He would like a hug. 13/10 som...,https://twitter.com/dog_rates/status/775085132...,13,10,Tucker,,,,,https://pbs.twimg.com/media/CsGnz64WYAEIDHJ.jpg,1.0,chow,0.316565,True,golden_retriever,0.241929,True,Pomeranian,0.157524,True,4488.0,14977.0
1034,745057283344719872,2016-06-21 00:54:33 +0000,Twitter for iPhone,This is Oliver. He's downright gorgeous as hel...,https://twitter.com/dog_rates/status/745057283...,12,10,Oliver,,,,,https://pbs.twimg.com/media/Clb5pLJWMAE-QS1.jpg,2.0,Shetland_sheepdog,0.963985,True,collie,0.026206,True,Border_collie,0.004544,True,2101.0,6899.0
1267,709566166965075968,2016-03-15 02:25:31 +0000,Twitter for iPhone,This is Olaf. He's gotta be rare. Seems sturdy...,https://twitter.com/dog_rates/status/709566166...,12,10,Olaf,,,,,https://pbs.twimg.com/media/Cdjiqi6XIAIUOg-.jpg,1.0,chow,0.999837,True,Tibetan_mastiff,0.000117,True,Australian_terrier,1.1e-05,True,1118.0,3344.0


##### Test

In [34]:
len(master_copy.source) == len(master_copy)

True

In [35]:
master_copy.source.head()

0    Twitter for iPhone
1    Twitter for iPhone
2    Twitter for iPhone
3    Twitter for iPhone
4    Twitter for iPhone
Name: source, dtype: object

In [36]:
master_copy.source.sample(5)

336     Twitter for iPhone
879     Twitter for iPhone
23      Twitter for iPhone
2221    Twitter for iPhone
636     Twitter for iPhone
Name: source, dtype: object

#### Define 
- Remove Retweets from the dataset 



#### Code

In [37]:
# Count the number of retweet in the dataset

master_copy.text.str.startswith('RT @').sum()

181

In [38]:
# Removie retweets from the dataset

mask = master_copy[master_copy.text.str.startswith('RT @')].index
master_copy.drop(mask ,inplace = True)


#### Test

In [39]:
# Check if there is any retweet in the dataset
master_copy.text.str.startswith('RT @').sum()


0

In [40]:
(len(master_copy)+ 181) == 2356

True


#### Define 

-  Reshape the dataframe by putting those identifier variables into one column `doggo` `floofer` `pupper` `puppo`should be in one Column


#### Code

In [41]:
# putting the columns we want to keep in a list
columns = ['tweet_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name','jpg_url',
 'img_num',
 'p1',
 'p1_conf',
 'p1_dog',
 'p2',
 'p2_conf',
 'p2_dog',
 'p3',
 'p3_conf',
 'p3_dog',
 'retweet_count',
 'likes_count']



In [42]:
# melt the dog stages columns into one columns 

master_copy = pd.melt(master_copy, id_vars= columns,value_vars= ['doggo', 'floofer', 'pupper', 'puppo'],
            var_name="type", value_name="dog_stage" )

#### Test

In [43]:
master_copy.head()

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,retweet_count,likes_count,type,dog_stage
0,892420643555336193,2017-08-01 16:23:56 +0000,Twitter for iPhone,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1.0,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False,7189.0,34531.0,doggo,
1,892177421306343426,2017-08-01 00:17:27 +0000,Twitter for iPhone,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1.0,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True,5396.0,29940.0,doggo,
2,891815181378084864,2017-07-31 00:18:03 +0000,Twitter for iPhone,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1.0,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True,3562.0,22530.0,doggo,
3,891689557279858688,2017-07-30 15:58:51 +0000,Twitter for iPhone,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1.0,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False,7413.0,37769.0,doggo,
4,891327558926688256,2017-07-29 16:00:24 +0000,Twitter for iPhone,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2.0,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True,7947.0,36071.0,doggo,


In [44]:
master_copy.dog_stage.value_counts()

None       8344
pupper      234
doggo        87
puppo        25
floofer      10
Name: dog_stage, dtype: int64

In [45]:
master_copy.drop(columns = ['type'], inplace = True)

In [46]:
master_copy = master_copy.sort_values('dog_stage').drop_duplicates('tweet_id', keep = 'last')
master_copy.dog_stage.value_counts()

None       1831
pupper      234
doggo        75
puppo        25
floofer      10
Name: dog_stage, dtype: int64

In [47]:
master_copy.tail()

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,retweet_count,likes_count,dog_stage
7430,738537504001953792,2016-06-03 01:07:16 +0000,Twitter for iPhone,This is Bayley. She fell asleep trying to esca...,https://twitter.com/dog_rates/status/738537504...,11,10,Bayley,https://pbs.twimg.com/media/Cj_P7rSUgAAYQbz.jpg,1.0,chow,0.808737,True,gibbon,0.028942,False,Pembroke,0.026498,True,1436.0,4811.0,puppo
7059,790946055508652032,2016-10-25 16:00:09 +0000,Twitter for iPhone,This is Betty. She's assisting with the dishes...,https://twitter.com/dog_rates/status/790946055...,12,10,Betty,https://pbs.twimg.com/media/CvoBPWRWgAA4het.jpg,1.0,dishwasher,0.700466,False,golden_retriever,0.245773,True,chow,0.039012,True,4504.0,16234.0,puppo
7395,743253157753532416,2016-06-16 01:25:36 +0000,Twitter for iPhone,This is Kilo. He cannot reach the snackum. Nif...,https://twitter.com/dog_rates/status/743253157...,10,10,Kilo,https://pbs.twimg.com/media/ClCQzFUUYAA5vAu.jpg,1.0,malamute,0.442612,True,Siberian_husky,0.368137,True,Eskimo_dog,0.177822,True,1147.0,4024.0,puppo
7276,756275833623502848,2016-07-21 23:53:04 +0000,Twitter for iPhone,When ur older siblings get to play in the deep...,https://twitter.com/dog_rates/status/756275833...,10,10,,https://pbs.twimg.com/media/Cn7U2xlW8AI9Pqp.jpg,1.0,Airedale,0.602957,True,Irish_terrier,0.086981,True,bloodhound,0.086276,True,1444.0,6098.0,puppo
7298,752519690950500352,2016-07-11 15:07:30 +0000,Twitter for iPhone,Hopefully this puppo on a swing will help get ...,https://twitter.com/dog_rates/status/752519690...,11,10,,https://pbs.twimg.com/media/CnF8qVDWYAAh0g1.jpg,3.0,swing,0.999984,False,Labrador_retriever,1e-05,True,Eskimo_dog,1e-06,True,3242.0,7026.0,puppo


In [48]:
master_copy.shape

(2175, 22)

#### Define

- Remove URLs from tweet texts

#### Code

In [49]:
string = master_copy.text.iloc[4]

In [50]:

for i in range(len(master_copy.text)):
    h = master_copy.text.iloc[i]
    try:
        indx = h.find(re.search("(?P<url>https?://[^\s]+)", h).group("url"))
        master_copy.text.iloc[i] = h[:indx].strip()
    except:
        master_copy.text.iloc[i] = h.strip()
    


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


#### Test

In [51]:
master_copy.text.sample(5)

5993    This pupper is very passionate about Christmas...
49      This is Koko. Her owner, inspired by Barney, r...
5421    This is Chuckles. He had a balloon but he acci...
3403    Meet Reagan. He's a Persnicketus Derpson. Grea...
3517    This is Cedrick. He's a spookster. Did me a di...
Name: text, dtype: object

<br>

## Qaulity

- Missing values in cloumns `reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`, `doggo`, `floofer`, `pupper`, `puppo`, `expanded_urls`
- `text` comlumn contains retweets 
- `name` comlumn cintains Inappropriate names ["a", "the", "None",'O','all','old']
- Incorrect `expanded_urls` for `tweet_id` 812503143955202048 and 710609963652087808
- Incorrect `source`,`expanded_urls` for `tweet_id` 710609963652087808
- imporper datatypes (`in_reply_to_status_id`, `in_reply_to_user_id`, `timestamp`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`)
- Null values are represented as 'None' instead of 'NaN'
- Some tweet ids have invalid `rating_numerator` and `rating_denominator`
- There 137 duplicates


- The first issue listed under <b>Quality</b> has been delt with when fixing <b>Tidiness</b> issues

- Replace all string None in the dataset with np.nan 
#### Code

In [52]:
# Check nulls before run the code
master_copy.isnull().sum()

tweet_id                0
timestamp               0
source                  0
text                    0
expanded_urls          58
rating_numerator        0
rating_denominator      0
name                    0
jpg_url               181
img_num               181
p1                    181
p1_conf               181
p1_dog                181
p2                    181
p2_conf               181
p2_dog                181
p3                    181
p3_conf               181
p3_dog                181
retweet_count           8
likes_count             8
dog_stage               0
dtype: int64

In [53]:
# Replace all string None in the dataset with np.nan 
master_copy.replace(to_replace="None", value=np.nan, inplace =True)

#### Test

In [54]:
# Check nulls after running the code
master_copy.isnull().sum()

tweet_id                 0
timestamp                0
source                   0
text                     0
expanded_urls           58
rating_numerator         0
rating_denominator       0
name                   680
jpg_url                181
img_num                181
p1                     181
p1_conf                181
p1_dog                 181
p2                     181
p2_conf                181
p2_dog                 181
p3                     181
p3_conf                181
p3_dog                 181
retweet_count            8
likes_count              8
dog_stage             1831
dtype: int64

In [55]:
master_copy.name.isnull().sum()

680