# 数据清理

- 数据整理，其中包括：
  - 收集数据
  - 评估数据：收集上述三个数据集之后，使用目测评估和编程评估的方式，对数据进行质量和清洁度的评估。在你的 wrangle_act.ipynb Jupyter Notebook 中记录评估过程和结果，最终列出至少 8 个质量问题 和 2 个清洁度问题。要符合项目规范，必须对项目动机中的要求进行评估（参见上一页课程的 关键要点 标题）。
  - 清洗数据：对你在评估时列出的每个问题进行清洗。在 wrangle_act.ipynb 展示清洗的过程。结果应该为一个优质干净整洁的主数据集（pandas DataFrame 类型） （如果都是以推特 ID 为观察对象的一些特征列，则清理最终只能有一个主数据集，如果有其他观察对象及其对应的特征字段，可以创建其他的数据集，同样需要清理）。同样地，必须符合项目动机的要点要求。
- 对清洗过的数据进行储存、分析和可视化：将清理后的数据集存储到 CSV 文件中，命名为 twitter_archive_master.csv。在 wrangle_act.ipynb Jupyter Notebook 中对清洗后的数据进行分析和可视化。必须生成至少 3 个见解和 1 个可视化。
- 书面报告 1) 你的数据整理工作 和 2) 你的数据分析和可视化：创建一个 300-600 字的书面报告，命名为 wrangle_report.pdf，在该报告中简要描述你的数据整理过程。这份报告可以看作是一份内部文档，供你的团队成员查看交流。创建一个 250 字以上的书面报告，命名为 act_report.pdf，在该报告中，你可以与读者交流观点，展示你使用整理过的数据生成的可视化图表。这份报告可以看作是一份外部文档，如博客帖子或杂志文章。

## 收集

In [213]:
import pandas as pd
import numpy as np
import json

In [214]:
tweets = pd.read_csv('twitter-archive-enhanced.csv')
predictions = pd.read_csv('image-predictions.tsv', sep='\t')
json = pd.read_json('tweet_json.json', lines=True)

In [4]:
#将json表格写入excel文件方便目测评估
writer = pd.ExcelWriter('json.xlsx')
json.to_excel(writer, 'json', index = False)
writer.save()

## 评估

In [7]:
# 显示表格tweets
tweets

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


`tweets`是WeRateDogs的推特档案，包括2356条含评分的推特的基本信息。原档案excel打开是2347行，而用pandas提取可以提取2356行。

In [215]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

tweet_id是整型，而不是字符串。  
timestamp是字符串，而不是日期格式。    
从retweeted_status_id、retweeted_status_user_id、retweeted_status_timestamp的信息可看出有181条转发数据。  

In [9]:
tweets.nunique()

tweet_id                      2356
in_reply_to_status_id           77
in_reply_to_user_id             31
timestamp                     2356
source                           4
text                          2356
retweeted_status_id            181
retweeted_status_user_id        25
retweeted_status_timestamp     181
expanded_urls                 2218
rating_numerator                40
rating_denominator              18
name                           957
doggo                            2
floofer                          2
pupper                           2
puppo                            2
dtype: int64

In [10]:
tweets.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [34]:
tweets[tweets.in_reply_to_status_id.notnull()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2.281182e+09,2017-07-15 16:51:35 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@NonWhiteHat @MayhewMayhem omg hello tanner you are a scary good boy 12/10 would pet with extreme caution,,,,,12,10,,,,,
55,881633300179243008,8.816070e+17,4.738443e+07,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@roushfenway These are good dogs but 17/10 is an emotional impulse rating. More like 13/10s,,,,,17,10,,,,,
64,879674319642796034,8.795538e+17,3.105441e+09,2017-06-27 12:14:36 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@RealKentMurphy 14/10 confirmed,,,,,14,10,,,,,
113,870726314365509632,8.707262e+17,1.648776e+07,2017-06-02 19:38:25 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is reserved for dogs,,,,,10,10,,,,,
148,863427515083354112,8.634256e+17,7.759620e+07,2017-05-13 16:15:35 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","@Jack_Septic_Eye I'd need a few more pics to polish a full analysis, but based on the good boy content above I'm leaning towards 12/10",,,,,12,10,,,,,
149,863079547188785154,6.671522e+17,4.196984e+09,2017-05-12 17:12:53 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Ladies and gentlemen... I found Pipsy. He may have changed his name to Pablo, but he never changed his love for the sea. Pupgraded to 14...",,,,https://twitter.com/dog_rates/status/863079547188785154/photo/1,14,10,,,,,
179,857214891891077121,8.571567e+17,1.806710e+08,2017-04-26 12:48:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@Marc_IRL pixelated af 12/10,,,,,12,10,,,,,
184,856526610513747968,8.558181e+17,4.196984e+09,2017-04-24 15:13:52 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","THIS IS CHARLIE, MARK. HE DID JUST WANT TO SAY HI AFTER ALL. PUPGRADED TO A 14/10. WOULD BE AN HONOR TO FLY WITH https://t.co/p1hBHCmWnA",,,,https://twitter.com/dog_rates/status/856526610513747968/photo/1,14,10,,,,,
186,856288084350160898,8.562860e+17,2.792810e+08,2017-04-23 23:26:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@xianmcguire @Jenna_Marbles Kardashians wouldn't be famous if as a society we didn't place enormous value on what they do. The dogs are ...,,,,,14,10,,,,,
188,855862651834028034,8.558616e+17,1.943518e+08,2017-04-22 19:15:32 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research,,,,,420,10,,,,,


In [38]:
tweets.rating_numerator.value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

In [39]:
tweets.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

分子有很多超过20的值，分母有很多不是10

In [14]:
tweets.name.value_counts()

None          745
a              55
Charlie        12
Cooper         11
Lucy           11
Oliver         11
Tucker         10
Lola           10
Penny          10
Bo              9
Winston         9
the             8
Sadie           8
Toby            7
an              7
Bailey          7
Daisy           7
Buddy           7
Koda            6
Oscar           6
Jax             6
Bella           6
Stanley         6
Scout           6
Leo             6
Jack            6
Dave            6
Rusty           6
Milo            6
Sammy           5
             ... 
Ferg            1
Mac             1
Kallie          1
Brandonald      1
Teddy           1
Miguel          1
Maya            1
Clybe           1
Grizzwald       1
Jockson         1
Comet           1
Sailor          1
Beebop          1
Kobe            1
Jessiga         1
Tyrus           1
Geoff           1
Barclay         1
Keet            1
Laika           1
Bobbay          1
light           1
Hero            1
Barney          1
Andru     

In [21]:
pd.set_option('max_colwidth', 140)
tweets[tweets.name== 'None'].text

5       Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWe...
7       When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nO...
12               Here's a puppo that seems to be on the fence about something haha no but seriously someone help her. 13/10 https://t.co/BxvuXk0UCm
24                                 You may not have known you needed to see this today. 13/10 please enjoy (IG: emmylouroo) https://t.co/WZqNqygEyV
25      This... is a Jubilant Antarctic House Bear. We only rate dogs. Please only send dogs. Thank you... 12/10 would suffocate in floof https:...
30                                        @NonWhiteHat @MayhewMayhem omg hello tanner you are a scary good boy 12/10 would pet with extreme caution
32                                                                                               RT @Athletics: 

In [29]:
tweets[tweets.name== 'None'].text.str.find('name').value_counts()

-1      731
 39       2
 26       2
 108      1
 63       1
 58       1
 46       1
 36       1
 33       1
 31       1
 20       1
 11       1
 4        1
Name: text, dtype: int64

In [30]:
tweets[tweets.name== 'an'].text.str.find('name').value_counts()

-1     6
 34    1
Name: text, dtype: int64

In [31]:
tweets[tweets.name== 'a'].text.str.find('name').value_counts()

-1     35
 30     4
 40     2
 33     2
 32     2
 48     1
 44     1
 42     1
 38     1
 37     1
 34     1
 31     1
 28     1
 27     1
 25     1
Name: text, dtype: int64

狗名有很多缺失值以及a和an这种无效值，这些缺失值和无效值对应的推文中有一部分可以提取出狗的名字。

In [12]:
tweets.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


None     2259
doggo      97
Name: doggo, dtype: int64

In [41]:
# 显示表格json
json

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,possibly_sensitive_appealable,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,2017-08-01 16:23:56,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 892420639486877696, 'id_str': '892420639486877696', 'i...","{'media': [{'id': 892420639486877696, 'id_str': '892420639486877696', 'indices': [86, 109], 'media_url': 'http://pbs.twimg.com/media/DGK...",39492,False,This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,...,0.0,,,,8842,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
1,,,2017-08-01 00:17:27,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 892177413194625024, 'id_str': '892177413194625024', 'i...","{'media': [{'id': 892177413194625024, 'id_str': '892177413194625024', 'indices': [139, 162], 'media_url': 'http://pbs.twimg.com/media/DG...",33786,False,"This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/...",,...,0.0,,,,6480,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
2,,,2017-07-31 00:18:03,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 891815175371796480, 'id_str': '891815175371796480', 'i...","{'media': [{'id': 891815175371796480, 'id_str': '891815175371796480', 'indices': [122, 145], 'media_url': 'http://pbs.twimg.com/media/DG...",25445,False,This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/w...,,...,0.0,,,,4301,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
3,,,2017-07-30 15:58:51,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 891689552724799489, 'id_str': '891689552724799489', 'i...","{'media': [{'id': 891689552724799489, 'id_str': '891689552724799489', 'indices': [80, 103], 'media_url': 'http://pbs.twimg.com/media/DF_...",42863,False,This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,...,0.0,,,,8925,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
4,,,2017-07-29 16:00:24,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': [129, 138]}], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 89132755194...","{'media': [{'id': 891327551943041024, 'id_str': '891327551943041024', 'indices': [139, 162], 'media_url': 'http://pbs.twimg.com/media/DF...",41016,False,"This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWe...",,...,0.0,,,,9721,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
5,,,2017-07-29 00:08:17,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': [129, 138]}], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 89108794217...","{'media': [{'id': 891087942176911360, 'id_str': '891087942176911360', 'indices': [139, 162], 'media_url': 'http://pbs.twimg.com/media/DF...",20548,False,Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWe...,,...,0.0,,,,3240,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
6,,,2017-07-28 16:27:12,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/Zr4hWfAs1H', 'expanded_url': 'https://gofundme.com/y...","{'media': [{'id': 890971906207338496, 'id_str': '890971906207338496', 'indices': [141, 164], 'media_url': 'http://pbs.twimg.com/media/DF...",12053,False,Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\n\nhttps://t.co/Zr4h...,,...,0.0,,,,2142,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
7,,,2017-07-28 00:22:40,"[0, 118]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 890729118844600320, 'id_str': '890729118844600320', 'i...","{'media': [{'id': 890729118844600320, 'id_str': '890729118844600320', 'indices': [119, 142], 'media_url': 'http://pbs.twimg.com/media/DF...",66596,False,When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nO...,,...,0.0,,,,19548,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
8,,,2017-07-27 16:25:51,"[0, 122]","{'hashtags': [{'text': 'BarkWeek', 'indices': [113, 122]}], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 89060917731...","{'media': [{'id': 890609177319665665, 'id_str': '890609177319665665', 'indices': [123, 146], 'media_url': 'http://pbs.twimg.com/media/DF...",28187,False,This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/...,,...,0.0,,,,4403,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
9,,,2017-07-26 15:59:51,"[0, 133]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 890240245463175168, 'id_str': '890240245463175168', 'i...","{'media': [{'id': 890240245463175168, 'id_str': '890240245463175168', 'indices': [134, 157], 'media_url': 'http://pbs.twimg.com/media/DF...",32467,False,This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate ht...,,...,0.0,,,,7684,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."


`json`是通过推特API获取的补充数据，包含此前遗漏的转发数（retweet count）和喜爱数（favorite count）。

In [42]:
json.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2352 entries, 0 to 2351
Data columns (total 31 columns):
contributors                     0 non-null float64
coordinates                      0 non-null float64
created_at                       2352 non-null datetime64[ns]
display_text_range               2352 non-null object
entities                         2352 non-null object
extended_entities                2073 non-null object
favorite_count                   2352 non-null int64
favorited                        2352 non-null bool
full_text                        2352 non-null object
geo                              0 non-null float64
id                               2352 non-null int64
id_str                           2352 non-null int64
in_reply_to_screen_name          78 non-null object
in_reply_to_status_id            78 non-null float64
in_reply_to_status_id_str        78 non-null float64
in_reply_to_user_id              78 non-null float64
in_reply_to_user_id_str          78 n

contributors、coordinates和geo列为空。  
从retweeted_status可看出有177条转发数据。  
id_str是整型，而不是字符串。  
created_at是字符串，而不是日期格式。   

In [56]:
json[json.retweeted_status.notnull()]

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,possibly_sensitive_appealable,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,truncated,user
31,,,2017-07-15 02:45:48,"[0, 50]","{'hashtags': [{'text': 'BATP', 'indices': [21, 26]}], 'symbols': [], 'user_mentions': [{'screen_name': 'Athletics', 'name': 'A's, But Sp...",,0,False,RT @Athletics: 12/10 #BATP https://t.co/WxwJmvjfxo,,...,0.0,,8.860534e+17,8.860534e+17,106,False,"{'created_at': 'Sat Jul 15 02:44:07 +0000 2017', 'id': 886053734421102592, 'id_str': '886053734421102592', 'full_text': '12/10 #BATP htt...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
35,,,2017-07-13 01:35:06,"[0, 133]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'dog_rates', 'name': 'SpookyWeRateDogs™', 'id': 4196983835, 'id_str': ...","{'media': [{'id': 830583314243268608, 'id_str': '830583314243268608', 'indices': [110, 133], 'media_url': 'http://pbs.twimg.com/media/C4...",0,False,RT @dog_rates: This is Lilly. She just parallel barked. Kindly requests a reward now. 13/10 would pet so well https://t.co/SATN4If5H5,,...,0.0,,,,19188,False,"{'created_at': 'Sun Feb 12 01:04:29 +0000 2017', 'id': 830583320585068544, 'id_str': '830583320585068544', 'full_text': 'This is Lilly. ...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
67,,,2017-06-26 00:13:58,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'dog_rates', 'name': 'SpookyWeRateDogs™', 'id': 4196983835, 'id_str': ...",,0,False,RT @dog_rates: This is Emmy. She was adopted today. Massive round of pupplause for Emmy and her new family. 14/10 for all involved https...,,...,,,,,7118,False,"{'created_at': 'Fri Jun 23 01:10:23 +0000 2017', 'id': 878057613040115712, 'id_str': '878057613040115712', 'full_text': 'This is Emmy. S...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
72,,,2017-06-24 00:09:53,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'dog_rates', 'name': 'SpookyWeRateDogs™', 'id': 4196983835, 'id_str': ...",,0,False,"RT @dog_rates: Meet Shadow. In an attempt to reach maximum zooming borkdrive, he tore his ACL. Still 13/10 tho. Help him out below\n\nht...",,...,,,,,1338,False,"{'created_at': 'Fri Jun 23 16:00:04 +0000 2017', 'id': 878281511006478336, 'id_str': '878281511006478336', 'full_text': 'Meet Shadow. In...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
73,,,2017-06-23 18:17:33,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'dog_rates', 'name': 'SpookyWeRateDogs™', 'id': 4196983835, 'id_str': ...",,0,False,RT @dog_rates: Meet Terrance. He's being yelled at because he stapled the wrong stuff together. 11/10 hang in there Terrance https://t.c...,,...,,,,,6925,False,"{'created_at': 'Tue Nov 24 03:51:38 +0000 2015', 'id': 669000397445533696, 'id_str': '669000397445533696', 'full_text': 'Meet Terrance. ...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
77,,,2017-06-21 19:36:23,"[0, 122]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'rachel2195', 'name': 'Rachel Buikema', 'id': 512804507, 'id_str': '51...","{'media': [{'id': 876850756556607488, 'id_str': '876850756556607488', 'indices': [99, 122], 'media_url': 'http://pbs.twimg.com/media/DCs...",0,False,RT @rachel2195: @dog_rates the boyfriend and his soaking wet pupper h*cking love his new hat 14/10 https://t.co/dJx4Gzc50G,,...,0.0,,,,82,False,"{'created_at': 'Mon Jun 19 17:14:49 +0000 2017', 'id': 876850772322988033, 'id_str': '876850772322988033', 'full_text': '@dog_rates the ...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
90,,,2017-06-13 01:14:41,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'dog_rates', 'name': 'SpookyWeRateDogs™', 'id': 4196983835, 'id_str': ...",,0,False,RT @dog_rates: This is Coco. At first I thought she was a cloud but clouds don't bork with such passion. 12/10 would hug softly https://...,,...,,,,,15442,False,"{'created_at': 'Sun May 21 16:48:45 +0000 2017', 'id': 866334964761202691, 'id_str': '866334964761202691', 'full_text': 'This is Coco. A...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
94,,,2017-06-11 00:25:14,"[0, 137]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'dog_rates', 'name': 'SpookyWeRateDogs™', 'id': 4196983835, 'id_str': ...","{'media': [{'id': 868880391209275392, 'id_str': '868880391209275392', 'indices': [114, 137], 'media_url': 'http://pbs.twimg.com/media/DA...",0,False,RT @dog_rates: This is Walter. He won't start hydrotherapy without his favorite floatie. 14/10 keep it pup Walter https://t.co/r28jFx9uyF,,...,0.0,,,,12435,False,"{'created_at': 'Sun May 28 17:23:24 +0000 2017', 'id': 868880397819494401, 'id_str': '868880397819494401', 'full_text': 'This is Walter....","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
96,,,2017-06-10 00:35:19,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'dog_rates', 'name': 'SpookyWeRateDogs™', 'id': 4196983835, 'id_str': ...",,0,False,RT @dog_rates: This is Sierra. She's one precious pupper. Absolute 12/10. Been in and out of ICU her whole life. Help Sierra below\n\nht...,,...,,,,,1656,False,"{'created_at': 'Fri Jun 09 16:22:42 +0000 2017', 'id': 873213775632977920, 'id_str': '873213775632977920', 'full_text': 'This is Sierra....","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
100,,,2017-06-08 04:17:07,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'loganamnosis', 'name': 'michael', 'id': 154767397, 'id_str': '1547673...",,0,False,"RT @loganamnosis: Penelope here is doing me quite a divertir. Well done, @dog_rates! Loving the pupdate. 14/10, je jouerais de nouveau. ...",,...,,,,,31,False,"{'created_at': 'Thu Jun 08 03:32:35 +0000 2017', 'id': 872657584259551233, 'id_str': '872657584259551233', 'full_text': 'Penelope here i...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."


In [58]:
json[json.quoted_status.notnull()]

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,possibly_sensitive_appealable,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,truncated,user
34,,,2017-07-13 15:19:09,"[0, 47]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/gzLHboL7Sk', 'expanded_url': 'https://twitter.com/4b...",,20739,False,I have a new hero and his name is Howard. 14/10 https://t.co/gzLHboL7Sk,,...,0.0,"{'created_at': 'Thu Jul 13 15:12:47 +0000 2017', 'id': 885517367337512960, 'id_str': '885517367337512960', 'full_text': '@dog_rates my g...",8.855174e+17,8.855174e+17,3876,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
41,,,2017-07-10 03:08:17,"[0, 104]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/uF3pQ8Wubj', 'expanded_url': 'https://twitter.com/ka...",,74192,False,OMG HE DIDN'T MEAN TO HE WAS JUST TRYING A LITTLE BARKOUR HE'S SUPER SORRY 13/10 WOULD FORGIVE IMMEDIATE https://t.co/uF3pQ8Wubj,,...,0.0,"{'created_at': 'Sun Jul 09 08:26:49 +0000 2017', 'id': 883965650754039809, 'id_str': '883965650754039809', 'full_text': 'Have you ever s...",8.839657e+17,8.839657e+17,21105,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
71,,,2017-06-24 13:24:20,"[0, 45]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/9uABQXgjwa', 'expanded_url': 'https://twitter.com/bb...",,30835,False,Martha is stunning how h*ckin dare you. 13/10 https://t.co/9uABQXgjwa,,...,0.0,"{'created_at': 'Sat Jun 24 13:05:06 +0000 2017', 'id': 878599868507402241, 'id_str': '878599868507402241', 'full_text': 'World's ugliest...",8.785999e+17,8.785999e+17,7510,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
82,,,2017-06-18 20:30:39,"[0, 117]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/8yoc1CNTsu', 'expanded_url': 'https://twitter.com/mp...",,23789,False,I can say with the pupmost confidence that the doggos who assisted with this search are heroic as h*ck. 14/10 for all https://t.co/8yoc1...,,...,0.0,"{'created_at': 'Sat Jun 17 19:41:50 +0000 2017', 'id': 876162994446753793, 'id_str': '876162994446753793', 'full_text': 'These are the a...",8.76163e+17,8.76163e+17,4775,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
87,,,2017-06-14 21:06:43,"[0, 96]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/PFp4MghzBW', 'expanded_url': 'https://twitter.com/dr...",,27915,False,You'll get your package when that precious man is done appreciating the pups. 13/10 for everyone https://t.co/PFp4MghzBW,,...,0.0,"{'created_at': 'Mon Jun 12 23:49:34 +0000 2017', 'id': 874413398133547008, 'id_str': '874413398133547008', 'full_text': 'So this is why ...",8.744134e+17,8.744134e+17,6303,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
109,,,2017-06-03 20:33:19,"[0, 25]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/AbBLh2FZCH', 'expanded_url': 'https://twitter.com/an...",,21403,False,Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH,,...,0.0,"{'created_at': 'Sat Jun 03 18:46:59 +0000 2017', 'id': 871075758080503809, 'id_str': '871075758080503809', 'full_text': 'A dog's clever ...",8.710758e+17,8.710758e+17,5729,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
131,,,2017-05-22 18:21:28,"[0, 50]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/Q8mVwWN3f4', 'expanded_url': 'https://twitter.com/nb...",,20805,False,He was providing for his family 13/10 how dare you https://t.co/Q8mVwWN3f4,,...,0.0,"{'created_at': 'Mon May 22 01:00:31 +0000 2017', 'id': 866458718883467265, 'id_str': '866458718883467265', 'full_text': 'Suspect collare...",8.664587e+17,8.664587e+17,5139,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
187,,,2017-04-22 18:55:51,"[0, 89]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/sb73bV5Y7S', 'expanded_url': 'https://twitter.com/pe...",,12449,False,"HE'S LIKE ""WAIT A MINUTE I'M AN ANIMAL THIS IS AMAZING HI HUMAN I LOVE YOU AS WELL"" 13/10 https://t.co/sb73bV5Y7S",,...,0.0,"{'created_at': 'Sat Apr 22 18:54:20 +0000 2017', 'id': 855857318168150016, 'id_str': '855857318168150016', 'full_text': 'They're good do...",8.558573e+17,8.558573e+17,2299,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
189,,,2017-04-22 16:18:34,"[0, 110]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/5BEjzT2Tth', 'expanded_url': 'https://twitter.com/ma...",,27952,False,I HEARD HE TIED HIS OWN BOWTIE MARK AND HE JUST WANTS TO SAY HI AND MAYBE A NOGGIN PAT SHOW SOME RESPECT 13/10 https://t.co/5BEjzT2Tth,,...,0.0,"{'created_at': 'Sat Apr 22 05:36:05 +0000 2017', 'id': 855656431005061120, 'id_str': '855656431005061120', 'full_text': 'Seriously, @del...",8.556564e+17,8.556564e+17,5905,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
238,,,2017-03-27 23:35:28,"[0, 66]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/GJ8rozumsy', 'expanded_url': 'https://twitter.com/sh...",,15250,False,THIS WAS NOT HIS FAULT HE HAD NO IDEA. 11/10 STILL A VERY GOOD DOG https://t.co/GJ8rozumsy,,...,0.0,"{'created_at': 'Mon Mar 27 22:11:17 +0000 2017', 'id': 846484798663245829, 'id_str': '846484798663245829', 'full_text': 'Dog Shipped to ...",8.464848e+17,8.464848e+17,3468,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."


In [51]:
json.favorited.value_counts()

False    2352
Name: favorited, dtype: int64

In [52]:
json.retweeted.value_counts()

False    2352
Name: retweeted, dtype: int64

In [54]:
json.truncated.value_counts()

False    2352
Name: truncated, dtype: int64

favorited、retweeted和truncated三列的值均为false

In [46]:
json.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2217
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [48]:
json[json.in_reply_to_status_id.notnull()]

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,possibly_sensitive_appealable,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,truncated,user
29,,,2017-07-15 16:51:35,"[27, 105]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'NonWhiteHat', 'name': 'Patrick Nonwhite', 'id': 2281181600, 'id_str':...",,117,False,@NonWhiteHat @MayhewMayhem omg hello tanner you are a scary good boy 12/10 would pet with extreme caution,,...,,,,,4,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
54,,,2017-07-02 21:58:53,"[13, 91]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'roushfenway', 'name': 'Roush Fenway Racing', 'id': 47384430, 'id_str'...",,129,False,@roushfenway These are good dogs but 17/10 is an emotional impulse rating. More like 13/10s,,...,,,,,7,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
63,,,2017-06-27 12:14:36,"[16, 31]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'RealKentMurphy', 'name': 'Kent', 'id': 3105440746, 'id_str': '3105440...",,313,False,@RealKentMurphy 14/10 confirmed,,...,,,,,10,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
112,,,2017-06-02 19:38:25,"[30, 60]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'ComplicitOwl', 'name': 'Derek', 'id': 16487760, 'id_str': '16487760',...",,120,False,@ComplicitOwl @ShopWeRateDogs &gt;10/10 is reserved for dogs,,...,,,,,3,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
146,,,2017-05-13 16:15:35,"[17, 134]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'Jack_Septic_Eye', 'name': 'Jacksepticeye', 'id': 77596200, 'id_str': ...",,2349,False,"@Jack_Septic_Eye I'd need a few more pics to polish a full analysis, but based on the good boy content above I'm leaning towards 12/10",,...,,,,,105,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
147,,,2017-05-12 17:12:53,"[0, 139]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 863079538779013120, 'id_str': '863079538779013120', 'i...","{'media': [{'id': 863079538779013120, 'id_str': '863079538779013120', 'indices': [140, 163], 'media_url': 'http://pbs.twimg.com/media/C_...",9068,False,"Ladies and gentlemen... I found Pipsy. He may have changed his name to Pablo, but he never changed his love for the sea. Pupgraded to 14...",,...,0.0,,,,1188,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
176,,,2017-04-26 12:48:51,"[10, 28]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'Marc_IRL', 'name': 'Marc Watson', 'id': 180670967, 'id_str': '1806709...",,242,False,@Marc_IRL pixelated af 12/10,,...,,,,,20,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
181,,,2017-04-24 15:13:52,"[0, 112]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 856526604033556482, 'id_str': '856526604033556482', 'i...","{'media': [{'id': 856526604033556482, 'id_str': '856526604033556482', 'indices': [113, 136], 'media_url': 'http://pbs.twimg.com/media/C-...",12412,False,"THIS IS CHARLIE, MARK. HE DID JUST WANT TO SAY HI AFTER ALL. PUPGRADED TO A 14/10. WOULD BE AN HONOR TO FLY WITH https://t.co/p1hBHCmWnA",,...,0.0,,,,2053,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
183,,,2017-04-23 23:26:03,"[28, 165]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'xianmcguire', 'name': 'Christian McGuire', 'id': 279280991, 'id_str':...",,540,False,@xianmcguire @Jenna_Marbles Kardashians wouldn't be famous if as a society we didn't place enormous value on what they do. The dogs are ...,,...,,,,,17,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."
185,,,2017-04-22 19:15:32,"[14, 86]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'dhmontgomery', 'name': 'David Montgomery', 'id': 194351775, 'id_str':...",,354,False,@dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research,,...,,,,,28,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WIL..."


In [59]:
json.favorite_count.value_counts()

0        177
1753       3
3548       3
689        3
1526       3
520        3
465        3
171        3
3508       3
343        3
2417       3
3217       2
3846       2
1501       2
2452       2
262        2
195        2
242        2
2616       2
2250       2
3221       2
1085       2
1187       2
1722       2
248        2
1124       2
14506      2
2231       2
1861       2
5377       2
        ... 
4715       1
23100      1
2644       1
8769       1
1671       1
39492      1
1498       1
5773       1
10824      1
1046       1
21069      1
35406      1
4687       1
2381       1
31314      1
8575       1
17001      1
12887      1
2648       1
4697       1
4699       1
2652       1
6750       1
8799       1
2656       1
10852      1
5878       1
14950      1
6760       1
15858      1
Name: favorite_count, Length: 2023, dtype: int64

In [60]:
json.retweet_count.value_counts()

1280     5
312      5
745      5
1554     4
1103     4
1201     4
37       4
61       4
606      4
680      4
701      4
182      4
6925     4
8471     4
468      4
252      3
1873     3
2690     3
71       3
118      3
1036     3
516      3
617      3
2142     3
280      3
263      3
521      3
1084     3
698      3
985      3
        ..
2482     1
4533     1
4535     1
441      1
2490     1
445      1
4479     1
377      1
325      1
2422     1
329      1
333      1
8527     1
6480     1
10580    1
345      1
347      1
2400     1
4449     1
705      1
6500     1
357      1
6504     1
361      1
6506     1
367      1
4465     1
2418     1
2420     1
0        1
Name: retweet_count, Length: 1752, dtype: int64

In [61]:
# 显示表格 predictions
predictions

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


In [62]:
predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


tweet_id是整型，而不是字符串。  
created_at是字符串，而不是日期格式。  

In [10]:
predictions.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [126]:
#检查tweets表格里的tweet_id是否都存在于predictions表格里
len(tweets[~tweets.tweet_id.isin(predictions.tweet_id)])

281

In [128]:
#检查tpredictions表格里的tweet_id是否都存在于tweets表格里
len(predictions[~predictions.tweet_id.isin(tweets.tweet_id)])

0

`tweets` 表格中有281条推特记录中的*tweet_id*在 `predictions` 表格中不存在。

#### 质量
##### `tweets` 表格

- tweet_id是整型，而不是字符串
- 有281条推特记录中的*tweet_id*在 `predictions` 表格中不存在。
- 含有181条转发数据
- 分子有很多超过20的值
- 分母有很多不是10
- 狗狗姓名提取不完整、不准确
- *in_reply_to_status_id*、*in_reply_to_user_id*、*source*、*expanded_urls*列与分析内容不太相关


##### `json` 表格
- id_str是整型，而不是字符串
- 含有177条转发数据
- *contributors*、*coordinates*和*geo*列为空
- *favorited*、*retweeted*和*truncated*三列的值均为false

##### `predictions` 表格
- tweet_id是整型，而不是字符串



#### 清洁度
-  `tweets` 表格中的*doggo*、*floofer*、*pupper*和*puppo*四列应合并为一个变量
-  `tweets` 表格、`json` 表格和`predictions`表格 应合并为一个表格  
  
## 清理

In [235]:
tweets_clean = tweets.copy()
json_clean = json.copy()
predictions_clean = predictions.copy()




### 清洁度

#### `tweets` 表格中的*doggo*、*floofer*、*pupper*和*puppo*四列应合并为一个变量：狗狗地位stage

#### 定义  
- 将*doggo*、*floofer*、*pupper*和*puppo*四列的值合并放入stage列，将合并值替换为正常值后删除*doggo*、*floofer*、*pupper*和*puppo*这四列。

#### 代码

In [236]:
tweets_clean['stage'] = tweets_clean['doggo'] + tweets_clean['floofer'] + tweets_clean['pupper'] + tweets_clean['puppo']
tweets_clean['stage'].replace('NoneNoneNoneNone', np.NaN, inplace=True)
tweets_clean['stage'].replace('doggoNoneNoneNone','doggo', inplace=True)
tweets_clean['stage'].replace('NoneflooferNoneNone','floofer', inplace=True)
tweets_clean['stage'].replace('NoneNonepupperNone','pupper', inplace=True)
tweets_clean['stage'].replace('NoneNoneNonepuppo','puppo', inplace=True)
tweets_clean.drop(['doggo','floofer','pupper','puppo'], axis=1, inplace=True)

In [237]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [238]:
tweets_clean.stage.value_counts()

pupper                  245
doggo                    83
puppo                    29
doggoNonepupperNone      12
floofer                   9
doggoNoneNonepuppo        1
doggoflooferNoneNone      1
Name: stage, dtype: int64

In [239]:
pd.set_option('max_colwidth', 140)
tweets_clean[tweets_clean.stage == 'doggoNoneNonepuppo'].text

191    Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for h...
Name: text, dtype: object

In [240]:
pd.set_option('max_colwidth', 140)
tweets_clean[tweets_clean.stage == 'doggoflooferNoneNone'].text

200    At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send ...
Name: text, dtype: object

In [241]:
pd.set_option('max_colwidth', 140)
tweets_clean[tweets_clean.stage == 'doggoNonepupperNone'].text

460     This is Dido. She's playing the lead role in "Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple." 13/10 (IG: didodo...
531     Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time https://t.co/AN...
565                                                                           Like doggo, like pupper version 2. Both 11/10 https://t.co/9IxWAXFqze
575     This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55...
705     This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautio...
733                                                                                      Pupper butt 1, Doggo 0. Both 12/10 https://t.co/WQvcPEpH2u
778                                                       RT @dog_rates: Like father (doggo), like son (pupper).

In [242]:
pd.set_option('max_colwidth', 140)
tweets_clean[tweets_clean.stage.isnull()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,stage
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/...",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/w...,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWe...",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWe...,,,,https://twitter.com/dog_rates/status/891087950875897856/photo/1,13,10,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\r\n\r\nhttps://t.co/...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1",13,10,Jax,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nO...,,,,"https://twitter.com/dog_rates/status/890729181411237888/photo/1,https://twitter.com/dog_rates/status/890729181411237888/photo/1",13,10,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/...,,,,https://twitter.com/dog_rates/status/890609185150312448/photo/1,13,10,Zoey,
10,890006608113172480,,,2017-07-26 00:31:25 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Koda. He is a South Australian deckshark. Deceptively deadly. Frighteningly majestic. 13/10 would risk a petting #BarkWeek https...,,,,"https://twitter.com/dog_rates/status/890006608113172480/photo/1,https://twitter.com/dog_rates/status/890006608113172480/photo/1",13,10,Koda,


#### 测试

In [243]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
stage                         380 non-null object
dtypes: float64(4), int64(3), object(7)
memory usage: 257.8+ KB


### 质量

#### `tweets` 表格包含181条转发数据

#### 定义  
- 删除181条转发数据，即只保留*retweeted_status_id*为空值的行

#### 代码

In [244]:
tweets_clean = tweets_clean[tweets_clean.retweeted_status_id.isnull()]

#### 测试

In [245]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id                      2175 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2175 non-null object
source                        2175 non-null object
text                          2175 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2117 non-null object
rating_numerator              2175 non-null int64
rating_denominator            2175 non-null int64
name                          2175 non-null object
stage                         344 non-null object
dtypes: float64(4), int64(3), object(7)
memory usage: 254.9+ KB


#### `tweets` 表格*in_reply_to_status_id*、*in_reply_to_user_id*、*source*、*expanded_urls*列以及与retweeted相关的列无用

#### 定义  
- 删除*in_reply_to_status_id*、*in_reply_to_user_id*、*source*、*expanded_urls*列和retweeted相关的列

#### 代码

In [246]:
tweets_clean = tweets_clean[['tweet_id', 'timestamp', 'text', 'rating_numerator', 'rating_denominator', 'name', 'stage']]

#### 测试

In [247]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 7 columns):
tweet_id              2175 non-null int64
timestamp             2175 non-null object
text                  2175 non-null object
rating_numerator      2175 non-null int64
rating_denominator    2175 non-null int64
name                  2175 non-null object
stage                 344 non-null object
dtypes: int64(3), object(4)
memory usage: 135.9+ KB


#### `tweets` 表格狗狗评分的分子和分母不准确

#### 定义  
- 用正则表达式重新提取狗狗评分的分子和分母

#### 代码

In [248]:
#查看分母是10的倍数的推特原文
tweets_clean[tweets_clean['rating_denominator'].isin([20,40,50,70,80,90,110,120,130,150,170])].text

433                                             The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd
902                                                                  Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
1120                      Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv
1165                                                                               Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a
1202                          This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq
1228                                                  Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1
1254                                   Here's a brigade of puppers. All look very prepared for whatever happens 

In [249]:
#查看分母不是10的倍数的推特原文
tweets_clean[tweets_clean['rating_denominator'].isin([2,7,11,15,16])].text

342                                                                                                        @docmisterio account started on 11/15/15
516     Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \r\nKeep Sam smiling by clicking and sharing this link:\r\nhttps://t....
1068    After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDND...
1662    This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5
1663    I'm aware that I could've said 20/16, but here at WeRateDogs we are very professional. An inconsistent rating scale is simply irresponsible
2335       This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv
Name: text, dtype: object

分母是10的n倍代表对n只狗进行评分，分母不是10的倍数则说明数据提取错误

In [250]:
tweets_clean['rating_numerator']=tweets_clean['rating_numerator']

In [251]:
tweets_clean['rating_denominator']=tweets_clean['rating_denominator']
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 7 columns):
tweet_id              2175 non-null int64
timestamp             2175 non-null object
text                  2175 non-null object
rating_numerator      2175 non-null int64
rating_denominator    2175 non-null int64
name                  2175 non-null object
stage                 344 non-null object
dtypes: int64(3), object(4)
memory usage: 135.9+ KB


In [252]:
#tweets_clean.rating_numerator = tweets_clean.text.str.extract('([1-9]+)(?:/[1-9]+0)', expand=False).values

In [254]:
tweets_clean.rating_numerator = tweets_clean.rating_numerator.astype(float)
tweets_clean.rating_denominator = tweets_clean.rating_denominator.astype(float)
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 7 columns):
tweet_id              2175 non-null int64
timestamp             2175 non-null object
text                  2175 non-null object
rating_numerator      2175 non-null float64
rating_denominator    2175 non-null float64
name                  2175 non-null object
stage                 344 non-null object
dtypes: float64(2), int64(1), object(4)
memory usage: 135.9+ KB


In [259]:
tweets_clean.rating_numerator = tweets_clean.text.str.extract('([1-9]*.?[1-9]+)(?:/[1-9]+0)', expand=False).values
tweets_clean.rating_denominator = tweets_clean.text.str.extract('(?:[1-9]*.?[1-9]+/)([1-9]+0)', expand=False).values


In [260]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 7 columns):
tweet_id              2175 non-null int64
timestamp             2175 non-null object
text                  2175 non-null object
rating_numerator      1733 non-null object
rating_denominator    1733 non-null object
name                  2175 non-null object
stage                 344 non-null object
dtypes: int64(1), object(6)
memory usage: 135.9+ KB


In [262]:
tweets_clean.rating_numerator = tweets_clean.rating_numerator.fillna(0.0).astype()
tweets_clean.rating_denominator = tweets_clean.rating_denominator.fillna(0.0)

In [265]:
tweets_clean.rating_denominator.value_counts()

10     1721
0.0     442
20        2
80        1
120       1
50        1
150       1
70        1
110       1
40        1
130       1
170       1
90        1
Name: rating_denominator, dtype: int64

In [266]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 7 columns):
tweet_id              2175 non-null int64
timestamp             2175 non-null object
text                  2175 non-null object
rating_numerator      2175 non-null object
rating_denominator    2175 non-null object
name                  2175 non-null object
stage                 344 non-null object
dtypes: int64(1), object(6)
memory usage: 135.9+ KB


In [264]:
tweets_clean.rating_numerator.value_counts()

 12      491
0.0      442
 11      420
 13      299
 9       151
 8        97
 7        56
 14       44
 5        34
 6        32
 4        19
 3        19
 2         8
 1         7
13         7
12         7
.11        4
.9         3
11         3
9          2
.12        2
5          1
204        1
;2         1
 44        1
 1776      1
 45        1
13.5       1
11.26      1
3 13       1
 182       1
 99        1
 165       1
 144       1
 15        1
(8         1
.13        1
 143       1
 666       1
 121       1
11.27      1
 17        1
.8         1
9.5        1
 88        1
6          1
07         1
 84        1
9.75       1
Name: rating_numerator, dtype: int64

In [229]:
#查看分子值较为异常的推特原文
tweets_clean[tweets_clean['rating_numerator'].isin([26,27,44,45,144,182,121,165,666,99,75,88,1176,143])].text

189           @s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10
290                                                                                                                              @markhoppus 182/10
340     RT @dog_rates: This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO...
695                This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS
763     This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile bac...
902                                                                  Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
1228                                                  Happy Saturday here's 9 puppers on a bench. 99/90 good wor

In [202]:
#查看分子值较为异常的推特原文
tweets_clean[tweets_clean['rating_numerator'].isin([;2,27,44,45,144,182,121,165,666,99,75,88,1176,143])].text

0                                     This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU
1       This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/...
2       This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/w...
3                                           This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ
4       This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWe...
5       Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWe...
6       Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by cli

#### 测试

#### `tweets` 表格狗狗姓名提取不完整、不准确

#### 定义  
- 用正则表达式重新提取狗狗姓名

#### 代码

In [117]:
tweets_clean.name = tweets_clean.text.str.extract('(?:This is|Meet|name is|Say hello to|named) ([A-Z][a-z]{2,15})', expand=False)

#### 测试

In [121]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 7 columns):
tweet_id              2175 non-null int64
timestamp             2175 non-null object
text                  2175 non-null object
rating_numerator      2175 non-null int64
rating_denominator    2175 non-null int64
name                  1399 non-null object
stage                 344 non-null object
dtypes: int64(3), object(4)
memory usage: 135.9+ KB


In [120]:
tweets_clean['name'].value_counts()

Charlie      11
Lucy         11
Oliver       10
Cooper       10
Penny         9
Tucker        9
Winston       8
Lola          8
Sadie         8
Toby          7
Daisy         7
Jax           6
Bella         6
Bailey        6
Oscar         6
Stanley       6
Koda          6
Milo          5
Buddy         5
Zoey          5
Chester       5
Louis         5
Bentley       5
Rusty         5
Scout         5
Leo           5
Gus           4
Sophie        4
Phil          4
Cassie        4
             ..
Sweets        1
Ron           1
Richie        1
Rodman        1
Sephie        1
Alfy          1
Eriq          1
Stephanus     1
Pavlov        1
Cherokee      1
Quinn         1
Arnold        1
Goliath       1
Combo         1
Mya           1
Harry         1
Flash         1
Strider       1
Bronte        1
Major         1
Rizzo         1
Rey           1
Sundance      1
Stormy        1
Buddah        1
Ralphie       1
Ronnie        1
Bobble        1
Joshwa        1
Meatball      1
Name: name, Length: 941,


#### `json` 表格包含177条转发数据

#### 定义  
- 删除177条转发数据，即只保留*retweeted_status*为空值的行

#### 代码

In [88]:
json_clean = json_clean[json_clean.retweeted_status.isnull()]

#### 测试

In [131]:
json_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2351
Data columns (total 3 columns):
id_str            2175 non-null int64
favorite_count    2175 non-null int64
retweet_count     2175 non-null int64
dtypes: int64(3)
memory usage: 68.0 KB




#### `json` 表格包含大量无用数据

#### 定义  
- 只保留*id_str*, *favorite_count*, *retweeted_status*列的数据

#### 代码

In [90]:
json_clean = json_clean[['id_str', 'favorite_count', 'retweet_count']]

#### 测试

In [92]:
json_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2351
Data columns (total 3 columns):
id_str            2175 non-null int64
favorite_count    2175 non-null int64
retweet_count     2175 non-null int64
dtypes: int64(3)
memory usage: 68.0 KB


### 清洁度

#### `tweets_clean` 表格、`json_clean` 表格和`predictions_clean` 表格应合并为一个表格

#### 定义  
- 将 `tweets_clean` 、`json_clean` 和`predictions_clean` 3个表格合并到`twitter_archive_master`表格，按照tweets_id和id_str进行合并 

#### 代码

#### 测试

## 保存

In [None]:
twitter_archive_master.to_csv('twitter_archive_master.csv', index: False)

## 分析和可视化