# Projet 2 : Préparation et Analyse des données

## Plan
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Rassembler les données</a>
    <ul>
        <li><a href="#readDAta">Lecture de données</a></li>
        <li><a href="#programmingDownload">Téléchargement programmatique des données</a></li>
        <li><a href="#dataFromTwitterApi">Récupération de données supplémentaires de l’API Twitter</a></li>
    </ul>
</li>
<li><a href="#eda">Évaluation des données</a></li>
<li><a href="#eda">Nettoyage des données</a></li>
<li><a href="#conclusions">Analyse des données</a></li>
<li><a href="#conclusions">Difficultés rencontrées</a></li>
<li><a href="#conclusions">Conclusion</a></li>
<li><a href="#conclusions">Références</a></li>
</ul>

Étape 1 : collecte de données

Étape 2 : évaluation des données

Étape 3 : nettoyage des données

Étape 4 : stockage des données

Étape 5 : analyser et visualiser les données

Étape 5 : analyser et visualiser les données

    vos efforts de traitement de données
    vos analyses et visualisations de données


<a id='intro'></a>
## Introduction

Dans le domaine de la data, les données avec lesquelles nous travaillerons seront rarement claires. Et c'est à nous que revient cette responsabilité de les évaluer et les nettoyer afin de faciliter les analyses et les visualisations qui en décroulent. D'où, Le but de ce projet est avant tout de tester les compétences en termes de l'évaluation et du nettoyage de données.<br> À l’aide de Python et de ses bibliothèques, nous allons collecterez des données provenant de diverses sources et dans divers formats, en évaluer la qualité et l'ordre, puis les nettoyer.

L’ensemble de données que nous allons traiter, (analyser et visualiser) est l’archive de tweets de l’utilisateur de Twitter @dog_rates, également connu sous le nom de WeRateDogs. WeRateDogs est un compte Twitter qui évalue les chiens des gens avec un commentaire humoristique sur le chien. Ces notes ont presque toujours un dénominateur de 10. WeRateDogs compte plus de 4 millions d’abonnés et a reçu une couverture médiatique internationale.

Nous avons en notre possession l'archive twitter de WeRateDogs qui contient des données de base (ID de tweet, horodatage, texte, etc.) pour l’ensemble de plus de 5 000 tweets.

<a id='wrangling'></a>
## Collecte des données
Cette étape nous permet de configurer notre espace de travail:
<ul>
    <li>Déclarer les dépendances(Importer les librairies et modules nécessaires) pour ce projet ;</li>
    <li>Importer(lire) le(s) fichier(s) de travail.</li>
</ul>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import os
import tweepy
import configparser  # Library to hide cridentials
import json
from timeit import default_timer as timer
import ast
import re

In [2]:
# Include a 'magic word' for visualizing plotted inline with the notebook
%matplotlib inline

In [3]:
# Setup pandas configurations
pd.set_option('display.max_columns', 21) # Augmenter la limite de colonnes à afficher par défaut.
pd.set_option('display.max_rows', 20) # Augmenter la limite de lignes à afficher par défaut.
pd.options.mode.chained_assignment = None  # Disable warnings 

<a id='readDAta'></a>
####  1.1. Les archives Twitter de WeRateDogs
Ici, nous allons charger notre ensemble de données(Dataframe) avec les données provenant du fichier mis à notre disposition en utilisant la fonction **read_csv()** de pandas.

In [4]:
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


<a id='programmingDownload'></a>
####  1.2. Les prédictions de l’image tweet : Téléchargement programmatique
Grâce à la library <b>requests</b>, nous allons télécharger et enregistrer le fichier <b>image_predictions.tsv</b> contenant certaines information telle que les urls des photos des chiens.

In [5]:
'''
# Get the image_predictions.tsv from url
#folder_name = 'image_predictions'
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open(os.path.join(url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)
'''

"\n# Get the image_predictions.tsv from url\n#folder_name = 'image_predictions'\nurl = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'\nresponse = requests.get(url)\nwith open(os.path.join(url.split('/')[-1]), mode='wb') as file:\n    file.write(response.content)\n"

<a id='dataFromTwitterApi'></a>
####  1.3. Données supplémentaires : Utilisation de l'API Twitter
Maintenant nous devons récupérer les données supplémentaires : le nombre de retweets de chaque tweet et le nombre de favoris (« j’aime ») en utilisant l’API Twitter. Pour y arriver, nous allons procéder de la manière suivante :
<ul>
    <li>Configurer twitter API</li>
    <li>Récupérer la liste de tous les identifiants des tweets(tweet_id) contenus dans <b>df_twitter_archive</b></li>
    <li>Utiliser l'API pour récupérer ces données supplémentaires </li>
    <li>Si un téléchargement réussi, stocker ces données sous forme de dictionnaire dans la variable <b>tweets_list</b></li>
    <li>Si un téléchargement échoue, enregistrer son tweet_id dans <b>error_tweets_list</b></li>
</ul>

In [6]:
# Twitter API configuration
config = configparser.RawConfigParser()
config.read('config.ini')

api_key = config['TWITTER']['api_key']
api_key_secret = config['TWITTER']['api_key_secret']

access_token = config['TWITTER']['access_token']
access_token_secret = config['TWITTER']['access_token_secret']

bearer_token = config['TWITTER']['bearer_token']

In [7]:
# Twitter API Authentication
auth = tweepy.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)

# Create the API instance
api = tweepy.API(auth)

In [8]:
# Get the list of all tweet_id of df_twitter_archive
tweet_id_list = list(twitter_archive['tweet_id'])
# Display the first 10 tweet_id of the list
print(len(tweet_id_list))

2356


In [9]:
'''
# Get additional data from twitter API

tweets_list = []
error_tweets_list = []
tweets_count = 0

start = timer()
for tweet_id in tweet_id_list:
    tweet = {}
    tweets_count += 1
    print('{0} : {1} fecthing data...'.format(tweets_count, tweet_id))

    try:
        
        tweet_data = api.get_status(tweet_id, tweet_mode="extended")
        retweet_count = tweet_data._json['retweet_count']
        favorite_count = tweet_data._json['favorite_count']
        
        tweet['tweet_id'] = str(tweet_id)
        tweet['retweet_count'] = retweet_count
        tweet['favorite_count'] = favorite_count
        tweets_list.append(tweet)
        
        print(tweet)
        print('Done')
    
    except:
        error_tweets_list.append(tweet_id)
        print('Failed')
        pass
    
    print('_______________________________________')

end = timer()
duration = end - start

# Convert the runtime in minutes-seconds
sec = duration % (24 * 3600)
hour = sec // 3600
sec %= 3600
min = sec // 60
sec %= 60

# Performance Report
print("Duration : %02d:%02d:%02d" % (hour, min, sec))
print('Total of successes : ', len(tweets_list))
print('Total of failures :', len(error_tweets_list))
print(tweets_list)
print(error_tweets_list)
'''

'\n# Get additional data from twitter API\n\ntweets_list = []\nerror_tweets_list = []\ntweets_count = 0\n\nstart = timer()\nfor tweet_id in tweet_id_list:\n    tweet = {}\n    tweets_count += 1\n    print(\'{0} : {1} fecthing data...\'.format(tweets_count, tweet_id))\n\n    try:\n        \n        tweet_data = api.get_status(tweet_id, tweet_mode="extended")\n        retweet_count = tweet_data._json[\'retweet_count\']\n        favorite_count = tweet_data._json[\'favorite_count\']\n        \n        tweet[\'tweet_id\'] = str(tweet_id)\n        tweet[\'retweet_count\'] = retweet_count\n        tweet[\'favorite_count\'] = favorite_count\n        tweets_list.append(tweet)\n        \n        print(tweet)\n        print(\'Done\')\n    \n    except:\n        error_tweets_list.append(tweet_id)\n        print(\'Failed\')\n        pass\n    \n    print(\'_______________________________________\')\n\nend = timer()\nduration = end - start\n\n# Convert the runtime in minutes-seconds\nsec = duration 

In [10]:
'''
# write additional data into tweet_json.txt
file_name = 'tweet_json.txt'
with open(file_name, 'w', encoding='utf-8') as file:
    for tweet in tweets_list:
        json.dump(tweet, file) 
        file.write("\n")
'''

'\n# write additional data into tweet_json.txt\nfile_name = \'tweet_json.txt\'\nwith open(file_name, \'w\', encoding=\'utf-8\') as file:\n    for tweet in tweets_list:\n        json.dump(tweet, file) \n        file.write("\n")\n'

In [11]:
image_predictions = pd.read_csv('image-predictions.tsv', sep='\t')


In [12]:
'''
# Don't consider this cell
tweets_json = pd.read_csv('tweet_json.txt', sep='|')
'''

"\n# Don't consider this cell\ntweets_json = pd.read_csv('tweet_json.txt', sep='|')\n"

In [13]:
tweet_json = pd.DataFrame(pd.read_csv('tweet_json.txt',sep='|',header=None).iloc[:,0].apply(ast.literal_eval).tolist())

### Évaluation des données

<a id='readDAta'></a>
####  2.1. Évaluation visuelle

In [14]:
twitter_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


<a id='readDAta'></a>
####  2.2. Évaluation programmatique

<a id='readDAta'></a>
####  2.2. Évaluation programmatique

In [15]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [16]:
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [17]:
tweet_json.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2325 entries, 0 to 2324
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tweet_id        2325 non-null   object
 1   retweet_count   2325 non-null   int64 
 2   favorite_count  2325 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 54.6+ KB


In [18]:
def getDuplicatedRows(dataframe):
    return ("Total duplicated rows = {0}".format(len(dataframe[dataframe.duplicated(subset=['tweet_id'])])))

In [19]:
# Verify duplicated rows
twitter_archive_duplicated_rows = getDuplicatedRows(twitter_archive)
image_predictions_duplicated_rows = getDuplicatedRows(image_predictions)
tweet_json_duplicated_rows = getDuplicatedRows(tweet_json)
print('twitter_archive :', twitter_archive_duplicated_rows)
print('image_predictions :', image_predictions_duplicated_rows)
print('tweet_json :', tweet_json_duplicated_rows)

twitter_archive : Total duplicated rows = 0
image_predictions : Total duplicated rows = 0
tweet_json : Total duplicated rows = 0


In [20]:
# Get the unique values of each column in a given list of columns for a dataframe
def getUniqueValues(dataframe, columns_list):
    list_of_colums_unique_values= []
    for col in columns_list:
        column_unique_values = {}
        column_list_of_unique_values = list(dataframe[col].unique())
        column_unique_values[col] = column_list_of_unique_values
        list_of_colums_unique_values.append(column_unique_values)
    return list_of_colums_unique_values

In [21]:
getUniqueValues(twitter_archive, ['rating_numerator', 'rating_denominator', 'doggo', 'floofer', 'pupper', 'puppo'])

[{'rating_numerator': [13,
   12,
   14,
   5,
   17,
   11,
   10,
   420,
   666,
   6,
   15,
   182,
   960,
   0,
   75,
   7,
   84,
   9,
   24,
   8,
   1,
   27,
   3,
   4,
   165,
   1776,
   204,
   50,
   99,
   80,
   45,
   60,
   44,
   143,
   121,
   20,
   26,
   2,
   144,
   88]},
 {'rating_denominator': [10,
   0,
   15,
   70,
   7,
   11,
   150,
   170,
   20,
   50,
   90,
   80,
   40,
   130,
   110,
   16,
   120,
   2]},
 {'doggo': ['None', 'doggo']},
 {'floofer': ['None', 'floofer']},
 {'pupper': ['None', 'pupper']},
 {'puppo': ['None', 'puppo']}]

In [22]:
getUniqueValues(image_predictions, ['img_num', 'p1_dog', 'p2_dog', 'p3_dog'])

[{'img_num': [1, 4, 2, 3]},
 {'p1_dog': [True, False]},
 {'p2_dog': [True, False]},
 {'p3_dog': [True, False]}]

In [72]:
# Get for 
def getValueCounts(dataframe, columns):
    list_of_col = []
    for col in columns:
        col_value_counts = {}
        value_counts = dataframe[col].value_counts()
        col_value_counts[col] = value_counts
        list_of_col.append(col_value_counts)
    return list_of_col
        

In [75]:
twitter_archive_value_counts = getValueCounts(twitter_archive, ['rating_numerator', 'rating_denominator'])
print(twitter_archive_value_counts)

[{'rating_numerator': 12     558
11     464
10     461
13     351
9      158
      ... 
27       1
45       1
99       1
121      1
204      1
Name: rating_numerator, Length: 40, dtype: int64}, {'rating_denominator': 10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64}]


### Problèmes de qualité
##### Table `twitter_archive`

1. Certaines valeurs de `rating_numerator` sont valides(Nombres), mais pas exactes(Au délà de la plage de valeurs). Par exemple la ligne 190(tweet_id = 855862651834028034) à une note de 420 ;

2. Les notes pour certains chiens sont valides mais inexactes(incorrectes). Exemple la ligne 1845 (tweet_id = 675853064436391936) a une note de 420 ;

3. Certaines valeurs de la colonne `text` contiennne deux notes, et souvent la première qui est prise, est une note erronée. Exemple ligne 315(tweet_id = 835246439529840640) contient deux notes, la première = 960/00 qui est erronée, et la deuxième = 13/10 qui est la bonne.

4. Cetraines notes contiennent une partie décimale, ce qui fait en sorte que lorsqu'on les extraits, seule la partie décimale est prise en compte pour le numérateur. Par exemple la ligne 342(tweet_id = 832215909146226688) a une note de 9.5/10 dans la colenne `text`, tandis que la colonne `ratting_numerator` récupère juste la partie décimale qui est 75.

4. Certaines valeurs du dénominateur sont différente de 10. Exemple ;

5. La colonne `tweet_id` est de type `int64` à la place du type `object` ;

6. La colonne `in_reply_to_status_id` est de type `float64` à la place du type `object` ;

7. La colonne `in_reply_to_user_id` est de type `float64` à la place du type `object` ;

8. La colonne `timestamp` est de type `object` à la place du type `datetime` ;

9. La colonne `retweeted_status_timestamp` est de type `object` à la place du type `datetime` ;

##### Table `image_predictions`
11. La colonne `tweet_id` est de type `int64` à la place du type `object`.

### Problèmes de rangement (ordre)

1. tweet_json doit être une partie de la table `twitter_archive` ;
2. Tous les posts dans la table `twitter_archive` ne sont pas de tweets, certains sont des retweets, des réponses aux tweets ou aux status. Ce sont des posts dont on trouve des valeurs pour les colonnes `in_reply_to_status_id`et `in_reply_to_user_id`, ou encore `retweeted_status_id`, `retweeted_status_user_id` et `retweeted_status_timestamp`.

### 3. Nettoyage des données

In [25]:
# Always start by copying the dataframe
twitter_archive_copy = twitter_archive.copy()
image_predictions_copy = image_predictions.copy()
tweet_json_copy = tweet_json.copy()

<a id='readDAta'></a>
#### Données manquantes

<font color='red'>La `rubrique extanded_urls` contient 2297 données au lieu de 2356.</font><br>
><ul>
    <li>Vu que cette colonne n'est pas indispensable pour la réalisation de nos analyses, nous jugeons bon de la laisser comme telle. Elle ne sera soumis à aucun traitement.</li>
</ul>

<a id='readDAta'></a>
####  3.1. Problèmes de qualité

##### Définir
<font color='red'>1. La colonne `tweet_id` est de type int64 à la place du type object.</font><br>
D'où :
><ul>
    <li>Convertir le type de donnée de la colonne tweet_id(int64) en un objet(object) grâce à la méthode</li>
</ul>

##### Coder

In [26]:
twitter_archive_copy.tweet_id = twitter_archive_copy.tweet_id.astype(str)

#### Tester

In [27]:
print(twitter_archive_copy.dtypes.tweet_id)

object


##### Définir
<font color='red'>2. La colonne `in_reply_to_status_id` est de type `float64` à la place du type `object`</font><br>
D'où :
><ul>
    <li>Convertir le type de donnée de la colonne in_reply_to_status_id(float64) en un objet(object) grâce à la méthode astype(str)</li>
</ul>

##### Coder

In [28]:
twitter_archive_copy.in_reply_to_status_id = twitter_archive_copy.in_reply_to_status_id.astype(str)

#### Tester

In [29]:
print(twitter_archive_copy.dtypes.in_reply_to_status_id )

object


##### Définir
<font color='red'>3. La colonne `in_reply_to_user_id` est de type `float64` à la place du type `object`</font><br>
D'où :
><ul>
    <li>Convertir le type de donnée de la colonne in_reply_to_user_id(float64) en un objet(object) grâce à la méthode astype(str)</li>
</ul>

##### Coder

In [30]:
twitter_archive_copy.in_reply_to_user_id = twitter_archive_copy.in_reply_to_user_id.astype(str)

#### Tester

In [31]:
print(twitter_archive_copy.dtypes.in_reply_to_status_id )

object


##### Définir
<font color='red'>4. La colonne `timestamp` est de type `object` à la place du type `datetime`</font><br>
D'où :
><ul>
    <li>Convertir le type de donnée de la colonne timestamp(object) en une date(datetime) grâce à la méthode to_datetime()</li>
</ul>

##### Coder

In [32]:
twitter_archive_copy.timestamp = twitter_archive_copy['timestamp'].astype('datetime64[ns]')

#### Tester

In [33]:
print(twitter_archive_copy.dtypes.timestamp)

datetime64[ns]


##### Définir
<font color='red'>5. La colonne `retweeted_status_timestamp` est de type `object` à la place du type `datetime`</font><br>
D'où :
><ul>
    <li>Convertir le type de donnée de la colonne retweeted_status_timestamp(object) en une date(datetime) grâce à la méthode to_datetime()</li>
</ul>

##### Coder

In [34]:
twitter_archive_copy.retweeted_status_timestamp = twitter_archive_copy['retweeted_status_timestamp'].astype('datetime64[ns]')

#### Tester

In [35]:
print(twitter_archive_copy.dtypes.retweeted_status_timestamp)

datetime64[ns]


##### Définir
<font color='red'>6. La colonne `tweet_id` de la Table `image_predictions` est de type int64 à la place du type object.</font><br>
D'où :
><ul>
    <li>Convertir le type de donnée de la colonne tweet_id(int64) en un objet(object) grâce à la méthode astype(str)</li>
</ul>

##### Coder

In [36]:
image_predictions_copy.tweet_id = image_predictions_copy.tweet_id.astype(str)

#### Tester

In [37]:
print(image_predictions_copy.dtypes.tweet_id)

object


In [38]:
#high_movies_vote_avarage = tmdb_df.query('vote_average >= {}'.format(vote_average_median))

In [39]:
note = 10
denominator_different_to_ten = twitter_archive_copy.loc[(twitter_archive_copy.rating_denominator != 10)]
denominator_different_to_ten

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35245984028504e+17,26259576.0,2017-02-24 21:54:03,"<a href=""http://twitter.com/download/iphone"" r...",@jonnysun @Lin_Manuel ok jomny I know you're e...,,,NaT,,960,0,,,,,
342,832088576586297345,8.320875475599974e+17,30582082.0,2017-02-16 04:45:50,"<a href=""http://twitter.com/download/iphone"" r...",@docmisterio account started on 11/15/15,,,NaT,,11,15,,,,,
433,820690176645140481,,,2017-01-15 17:52:40,"<a href=""http://twitter.com/download/iphone"" r...",The floofs have been released I repeat the flo...,,,NaT,https://twitter.com/dog_rates/status/820690176...,84,70,,,,,
516,810984652412424192,,,2016-12-19 23:06:23,"<a href=""http://twitter.com/download/iphone"" r...",Meet Sam. She smiles 24/7 &amp; secretly aspir...,,,NaT,"https://www.gofundme.com/sams-smile,https://tw...",24,7,Sam,,,,
784,775096608509886464,,,2016-09-11 22:20:06,"<a href=""http://twitter.com/download/iphone"" r...","RT @dog_rates: After so many requests, this is...",7.403732e+17,4.196984e+09,2016-06-08 02:41:38,https://twitter.com/dog_rates/status/740373189...,9,11,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1662,682962037429899265,,,2016-01-01 16:30:13,"<a href=""http://twitter.com/download/iphone"" r...",This is Darrel. He just robbed a 7/11 and is i...,,,NaT,https://twitter.com/dog_rates/status/682962037...,7,11,Darrel,,,,
1663,682808988178739200,6.827884415375606e+17,4196983835.0,2016-01-01 06:22:03,"<a href=""http://twitter.com/download/iphone"" r...","I'm aware that I could've said 20/16, but here...",,,NaT,,20,16,,,,,
1779,677716515794329600,,,2015-12-18 05:06:23,"<a href=""http://twitter.com/download/iphone"" r...",IT'S PUPPERGEDDON. Total of 144/120 ...I think...,,,NaT,https://twitter.com/dog_rates/status/677716515...,144,120,,,,,
1843,675853064436391936,,,2015-12-13 01:41:41,"<a href=""http://twitter.com/download/iphone"" r...",Here we have an entire platoon of puppers. Tot...,,,NaT,https://twitter.com/dog_rates/status/675853064...,88,80,,,,,


In [40]:
#tweets_with_several_double = twitter_archive_copy.query(re.findall("([0-9]/)",'textspokdp  okd p kdpk 123/00 d,zo dzpj opz 10/10ddz kz pokz 13/10 kdzml'))
#print(range(len(denominator_different_to_ten)))
#print(dict(denominator_different_to_ten.text))
#text
tweets_with_several_notes=[]
for row in denominator_different_to_ten.text:
    tweets_with_several_notes.append(re.findall("([0-9]+/[0-9]+)", row))
print(tweets_with_several_notes)
tweets_with_several = denominator_different_to_ten.query('text.str.contains("([0-9]+/[0-9]+?[0-9]+/[0-9])")')
#tweets_with_several.text

[['960/00', '13/10'], ['11/15'], ['84/70'], ['24/7'], ['9/11', '14/10'], ['165/150'], ['9/11', '14/10'], ['204/170'], ['4/20', '13/10'], ['50/50', '11/10'], ['99/90'], ['80/80'], ['45/50'], ['60/50'], ['44/40'], ['4/20'], ['143/130'], ['121/110'], ['7/11', '10/10'], ['20/16'], ['144/120'], ['88/80'], ['1/2', '9/10']]


  return func(self, *args, **kwargs)


In [41]:
data = [s[-1] for s in tweets_with_several_notes]
print(data)

['13/10', '11/15', '84/70', '24/7', '14/10', '165/150', '14/10', '204/170', '13/10', '11/10', '99/90', '80/80', '45/50', '60/50', '44/40', '4/20', '143/130', '121/110', '10/10', '20/16', '144/120', '88/80', '9/10']


In [42]:
'''
Compute the true numerator when the denominator of the note isn't 10 by the equation :
true_numerator/10 = numerator/denominator
=> true_numerator*denominator = numerator*10
=> true_numerator = (numerator*10)/denominator
'''
def convertNoteByTen(list_of_texts):
    numerators_list = []
    for element in list_of_texts:
        numerator = round(float(element.split('/')[0]))
        denominator = round(float(element.split('/')[1]))
        #numerator, denominator = int(numerator), int(denominator)
        if (denominator != 10):
            numerator = (numerator*10)/denominator
        numerators_list.append(round(numerator))
    return numerators_list

In [43]:
list_of_rattings_numerators = convertNoteByTen(data)
list_of_rattings_numerators

[13,
 7,
 12,
 34,
 14,
 11,
 14,
 12,
 13,
 11,
 11,
 10,
 9,
 12,
 11,
 2,
 11,
 11,
 10,
 12,
 12,
 11,
 9]

In [44]:
# Replace the ratting_numerator column of denominator_different_to_ten by the numerators_list

# 1. Find the name of the column by index
index_of_column_to_replace = denominator_different_to_ten.columns[10]
print(index_of_column_to_replace)

# 2. Drop that column
denominator_different_to_ten.drop(index_of_column_to_replace, axis = 1, inplace = True)

# 3. Replace the rating_denominator
denominator_different_to_ten.rating_denominator = 10

# 4. Put whatever series you want in its place
denominator_different_to_ten['rating_numerator'] = list_of_rattings_numerators

# 5. Verify the result
print(denominator_different_to_ten.info())
denominator_different_to_ten

rating_numerator
<class 'pandas.core.frame.DataFrame'>
Int64Index: 23 entries, 313 to 2335
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   tweet_id                    23 non-null     object        
 1   in_reply_to_status_id       23 non-null     object        
 2   in_reply_to_user_id         23 non-null     object        
 3   timestamp                   23 non-null     datetime64[ns]
 4   source                      23 non-null     object        
 5   text                        23 non-null     object        
 6   retweeted_status_id         1 non-null      float64       
 7   retweeted_status_user_id    1 non-null      float64       
 8   retweeted_status_timestamp  1 non-null      datetime64[ns]
 9   expanded_urls               19 non-null     object        
 10  rating_denominator          23 non-null     int64         
 11  name                        23 non-null

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_denominator,name,doggo,floofer,pupper,puppo,rating_numerator
313,835246439529840640,8.35245984028504e+17,26259576.0,2017-02-24 21:54:03,"<a href=""http://twitter.com/download/iphone"" r...",@jonnysun @Lin_Manuel ok jomny I know you're e...,,,NaT,,10,,,,,,13
342,832088576586297345,8.320875475599974e+17,30582082.0,2017-02-16 04:45:50,"<a href=""http://twitter.com/download/iphone"" r...",@docmisterio account started on 11/15/15,,,NaT,,10,,,,,,7
433,820690176645140481,,,2017-01-15 17:52:40,"<a href=""http://twitter.com/download/iphone"" r...",The floofs have been released I repeat the flo...,,,NaT,https://twitter.com/dog_rates/status/820690176...,10,,,,,,12
516,810984652412424192,,,2016-12-19 23:06:23,"<a href=""http://twitter.com/download/iphone"" r...",Meet Sam. She smiles 24/7 &amp; secretly aspir...,,,NaT,"https://www.gofundme.com/sams-smile,https://tw...",10,Sam,,,,,34
784,775096608509886464,,,2016-09-11 22:20:06,"<a href=""http://twitter.com/download/iphone"" r...","RT @dog_rates: After so many requests, this is...",7.403732e+17,4.196984e+09,2016-06-08 02:41:38,https://twitter.com/dog_rates/status/740373189...,10,,,,,,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1662,682962037429899265,,,2016-01-01 16:30:13,"<a href=""http://twitter.com/download/iphone"" r...",This is Darrel. He just robbed a 7/11 and is i...,,,NaT,https://twitter.com/dog_rates/status/682962037...,10,Darrel,,,,,10
1663,682808988178739200,6.827884415375606e+17,4196983835.0,2016-01-01 06:22:03,"<a href=""http://twitter.com/download/iphone"" r...","I'm aware that I could've said 20/16, but here...",,,NaT,,10,,,,,,12
1779,677716515794329600,,,2015-12-18 05:06:23,"<a href=""http://twitter.com/download/iphone"" r...",IT'S PUPPERGEDDON. Total of 144/120 ...I think...,,,NaT,https://twitter.com/dog_rates/status/677716515...,10,,,,,,12
1843,675853064436391936,,,2015-12-13 01:41:41,"<a href=""http://twitter.com/download/iphone"" r...",Here we have an entire platoon of puppers. Tot...,,,NaT,https://twitter.com/dog_rates/status/675853064...,10,,,,,,11


In [45]:
# Get the tweets witch contain decimal notes in the text column.
notes_with_decimals = twitter_archive_copy.loc[(twitter_archive_copy.text.str.contains(("\d+\.\d+/") or ("/+\d+\.\d+/+\d")))]
notes_with_decimals

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
45,883482846933004288,,,2017-07-08 00:28:19,"<a href=""http://twitter.com/download/iphone"" r...",This is Bella. She hopes her smile made you sm...,,,NaT,https://twitter.com/dog_rates/status/883482846...,5,10,Bella,,,,
340,832215909146226688,,,2017-02-16 13:11:49,"<a href=""http://twitter.com/download/iphone"" r...","RT @dog_rates: This is Logan, the Chow who liv...",7.867091e+17,4196984000.0,2016-10-13 23:23:56,https://twitter.com/dog_rates/status/786709082...,75,10,Logan,,,,
695,786709082849828864,,,2016-10-13 23:23:56,"<a href=""http://twitter.com/download/iphone"" r...","This is Logan, the Chow who lived. He solemnly...",,,NaT,https://twitter.com/dog_rates/status/786709082...,75,10,Logan,,,,
763,778027034220126208,,,2016-09-20 00:24:34,"<a href=""http://twitter.com/download/iphone"" r...",This is Sophie. She's a Jubilant Bush Pupper. ...,,,NaT,https://twitter.com/dog_rates/status/778027034...,27,10,Sophie,,,pupper,
1689,681340665377193984,6.813394486558024e+17,4196983835.0,2015-12-28 05:07:27,"<a href=""http://twitter.com/download/iphone"" r...",I've been told there's a slight possibility he...,,,NaT,,5,10,,,,,
1712,680494726643068929,,,2015-12-25 21:06:00,"<a href=""http://twitter.com/download/iphone"" r...",Here we have uncovered an entire battalion of ...,,,NaT,https://twitter.com/dog_rates/status/680494726...,26,10,,,,,


In [46]:
notes_list=[]
for row in notes_with_decimals.text:
    note = (re.findall("(\d+.\d+/+\d+)", row))
    notes_list.append(note)
print(notes_list)

[['13.5/10'], ['9.75/10'], ['9.75/10'], ['11.27/10'], ['9.5/10'], ['11.26/10']]


In [47]:
decimals_notes_list = [s[-1] for s in notes_list]
print(decimals_notes_list)
list_of_rattings_numerators_from_decimals_numbers = convertNoteByTen(decimals_notes_list)
list_of_rattings_numerators_from_decimals_numbers

['13.5/10', '9.75/10', '9.75/10', '11.27/10', '9.5/10', '11.26/10']


[14, 10, 10, 11, 10, 11]

In [48]:
# Replace the ratting_numerator column of denominator_different_to_ten by the list_of_rattings_numerators_from_decimals_numbers

# 1. Find the name of the column by index
index_of_column_to_replace = notes_with_decimals.columns[10]
print(index_of_column_to_replace)

# 2. Drop that column
notes_with_decimals.drop(index_of_column_to_replace, axis = 1, inplace = True)

# 3. Put whatever series you want in its place
notes_with_decimals['rating_numerator'] = list_of_rattings_numerators_from_decimals_numbers

#Verify the result
notes_with_decimals

rating_numerator


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_denominator,name,doggo,floofer,pupper,puppo,rating_numerator
45,883482846933004288,,,2017-07-08 00:28:19,"<a href=""http://twitter.com/download/iphone"" r...",This is Bella. She hopes her smile made you sm...,,,NaT,https://twitter.com/dog_rates/status/883482846...,10,Bella,,,,,14
340,832215909146226688,,,2017-02-16 13:11:49,"<a href=""http://twitter.com/download/iphone"" r...","RT @dog_rates: This is Logan, the Chow who liv...",7.867091e+17,4196984000.0,2016-10-13 23:23:56,https://twitter.com/dog_rates/status/786709082...,10,Logan,,,,,10
695,786709082849828864,,,2016-10-13 23:23:56,"<a href=""http://twitter.com/download/iphone"" r...","This is Logan, the Chow who lived. He solemnly...",,,NaT,https://twitter.com/dog_rates/status/786709082...,10,Logan,,,,,10
763,778027034220126208,,,2016-09-20 00:24:34,"<a href=""http://twitter.com/download/iphone"" r...",This is Sophie. She's a Jubilant Bush Pupper. ...,,,NaT,https://twitter.com/dog_rates/status/778027034...,10,Sophie,,,pupper,,11
1689,681340665377193984,6.813394486558024e+17,4196983835.0,2015-12-28 05:07:27,"<a href=""http://twitter.com/download/iphone"" r...",I've been told there's a slight possibility he...,,,NaT,,10,,,,,,10
1712,680494726643068929,,,2015-12-25 21:06:00,"<a href=""http://twitter.com/download/iphone"" r...",Here we have uncovered an entire battalion of ...,,,NaT,https://twitter.com/dog_rates/status/680494726...,10,,,,,,11


<a id='readDAta'></a>
####  3.2. Problèmes de rangement(d'ordre)

In [49]:
notes_with_decimals

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_denominator,name,doggo,floofer,pupper,puppo,rating_numerator
45,883482846933004288,,,2017-07-08 00:28:19,"<a href=""http://twitter.com/download/iphone"" r...",This is Bella. She hopes her smile made you sm...,,,NaT,https://twitter.com/dog_rates/status/883482846...,10,Bella,,,,,14
340,832215909146226688,,,2017-02-16 13:11:49,"<a href=""http://twitter.com/download/iphone"" r...","RT @dog_rates: This is Logan, the Chow who liv...",7.867091e+17,4196984000.0,2016-10-13 23:23:56,https://twitter.com/dog_rates/status/786709082...,10,Logan,,,,,10
695,786709082849828864,,,2016-10-13 23:23:56,"<a href=""http://twitter.com/download/iphone"" r...","This is Logan, the Chow who lived. He solemnly...",,,NaT,https://twitter.com/dog_rates/status/786709082...,10,Logan,,,,,10
763,778027034220126208,,,2016-09-20 00:24:34,"<a href=""http://twitter.com/download/iphone"" r...",This is Sophie. She's a Jubilant Bush Pupper. ...,,,NaT,https://twitter.com/dog_rates/status/778027034...,10,Sophie,,,pupper,,11
1689,681340665377193984,6.813394486558024e+17,4196983835.0,2015-12-28 05:07:27,"<a href=""http://twitter.com/download/iphone"" r...",I've been told there's a slight possibility he...,,,NaT,,10,,,,,,10
1712,680494726643068929,,,2015-12-25 21:06:00,"<a href=""http://twitter.com/download/iphone"" r...",Here we have uncovered an entire battalion of ...,,,NaT,https://twitter.com/dog_rates/status/680494726...,10,,,,,,11


In [50]:
# Reset the order of DataFrame columns function
def resetColumnsOrder(dataframe, ordered_columns_list):
    return dataframe[ordered_columns_list]

In [51]:
ordered_twitter_archive_copy_columns_list = list(twitter_archive_copy.columns.values)
ordered_twitter_archive_copy_columns_list

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floofer',
 'pupper',
 'puppo']

In [52]:
# Reset the order of the denominator_different_to_ten DataFrame
denominator_different_to_ten = resetColumnsOrder(denominator_different_to_ten, ordered_twitter_archive_copy_columns_list)
denominator_different_to_ten

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35245984028504e+17,26259576.0,2017-02-24 21:54:03,"<a href=""http://twitter.com/download/iphone"" r...",@jonnysun @Lin_Manuel ok jomny I know you're e...,,,NaT,,13,10,,,,,
342,832088576586297345,8.320875475599974e+17,30582082.0,2017-02-16 04:45:50,"<a href=""http://twitter.com/download/iphone"" r...",@docmisterio account started on 11/15/15,,,NaT,,7,10,,,,,
433,820690176645140481,,,2017-01-15 17:52:40,"<a href=""http://twitter.com/download/iphone"" r...",The floofs have been released I repeat the flo...,,,NaT,https://twitter.com/dog_rates/status/820690176...,12,10,,,,,
516,810984652412424192,,,2016-12-19 23:06:23,"<a href=""http://twitter.com/download/iphone"" r...",Meet Sam. She smiles 24/7 &amp; secretly aspir...,,,NaT,"https://www.gofundme.com/sams-smile,https://tw...",34,10,Sam,,,,
784,775096608509886464,,,2016-09-11 22:20:06,"<a href=""http://twitter.com/download/iphone"" r...","RT @dog_rates: After so many requests, this is...",7.403732e+17,4.196984e+09,2016-06-08 02:41:38,https://twitter.com/dog_rates/status/740373189...,14,10,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1662,682962037429899265,,,2016-01-01 16:30:13,"<a href=""http://twitter.com/download/iphone"" r...",This is Darrel. He just robbed a 7/11 and is i...,,,NaT,https://twitter.com/dog_rates/status/682962037...,10,10,Darrel,,,,
1663,682808988178739200,6.827884415375606e+17,4196983835.0,2016-01-01 06:22:03,"<a href=""http://twitter.com/download/iphone"" r...","I'm aware that I could've said 20/16, but here...",,,NaT,,12,10,,,,,
1779,677716515794329600,,,2015-12-18 05:06:23,"<a href=""http://twitter.com/download/iphone"" r...",IT'S PUPPERGEDDON. Total of 144/120 ...I think...,,,NaT,https://twitter.com/dog_rates/status/677716515...,12,10,,,,,
1843,675853064436391936,,,2015-12-13 01:41:41,"<a href=""http://twitter.com/download/iphone"" r...",Here we have an entire platoon of puppers. Tot...,,,NaT,https://twitter.com/dog_rates/status/675853064...,11,10,,,,,


In [53]:
# Reset the order of the notes_with_decimals DataFrame
notes_with_decimals = resetColumnsOrder(notes_with_decimals, ordered_twitter_archive_copy_columns_list)
notes_with_decimals

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
45,883482846933004288,,,2017-07-08 00:28:19,"<a href=""http://twitter.com/download/iphone"" r...",This is Bella. She hopes her smile made you sm...,,,NaT,https://twitter.com/dog_rates/status/883482846...,14,10,Bella,,,,
340,832215909146226688,,,2017-02-16 13:11:49,"<a href=""http://twitter.com/download/iphone"" r...","RT @dog_rates: This is Logan, the Chow who liv...",7.867091e+17,4196984000.0,2016-10-13 23:23:56,https://twitter.com/dog_rates/status/786709082...,10,10,Logan,,,,
695,786709082849828864,,,2016-10-13 23:23:56,"<a href=""http://twitter.com/download/iphone"" r...","This is Logan, the Chow who lived. He solemnly...",,,NaT,https://twitter.com/dog_rates/status/786709082...,10,10,Logan,,,,
763,778027034220126208,,,2016-09-20 00:24:34,"<a href=""http://twitter.com/download/iphone"" r...",This is Sophie. She's a Jubilant Bush Pupper. ...,,,NaT,https://twitter.com/dog_rates/status/778027034...,11,10,Sophie,,,pupper,
1689,681340665377193984,6.813394486558024e+17,4196983835.0,2015-12-28 05:07:27,"<a href=""http://twitter.com/download/iphone"" r...",I've been told there's a slight possibility he...,,,NaT,,10,10,,,,,
1712,680494726643068929,,,2015-12-25 21:06:00,"<a href=""http://twitter.com/download/iphone"" r...",Here we have uncovered an entire battalion of ...,,,NaT,https://twitter.com/dog_rates/status/680494726...,11,10,,,,,


In [54]:
# Merge all dataframe witch contain good notes
dataframes = [denominator_different_to_ten, notes_with_decimals]
merged_dataframe = pd.concat(dataframes)

In [55]:
# test the merge operation
print('denominator_different_to_ten : {0}\nnotes_with_decimals : {1}\nmerged_dataframe : {2}\nTotal of rows merge = {0}+{1} = {2}'.format(denominator_different_to_ten.shape, notes_with_decimals.shape, merged_dataframe.shape))

denominator_different_to_ten : (23, 17)
notes_with_decimals : (6, 17)
merged_dataframe : (29, 17)
Total of rows merge = (23, 17)+(6, 17) = (29, 17)


In [56]:
# Delete rows witch contain bad notes 
for row in merged_dataframe.tweet_id:
    twitter_archive_copy.drop(twitter_archive_copy[twitter_archive_copy['tweet_id'] == row].index, inplace = True)

In [57]:
# Test the delete rows operation
print('twitter_archive after drop rows : {0}\nmerged_dataframe : {1}\nTotal of rows of twitter_archive_copy before drop rows = {0}+{1} = {2}'.format(twitter_archive_copy.shape, merged_dataframe.shape, twitter_archive.shape))

twitter_archive after drop rows : (2327, 17)
merged_dataframe : (29, 17)
Total of rows of twitter_archive_copy before drop rows = (2327, 17)+(29, 17) = (2356, 17)


In [58]:
twitter_archive_copy.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,NaT,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,NaT,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,NaT,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,NaT,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,NaT,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [59]:
merged_dataframe

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35245984028504e+17,26259576.0,2017-02-24 21:54:03,"<a href=""http://twitter.com/download/iphone"" r...",@jonnysun @Lin_Manuel ok jomny I know you're e...,,,NaT,,13,10,,,,,
342,832088576586297345,8.320875475599974e+17,30582082.0,2017-02-16 04:45:50,"<a href=""http://twitter.com/download/iphone"" r...",@docmisterio account started on 11/15/15,,,NaT,,7,10,,,,,
433,820690176645140481,,,2017-01-15 17:52:40,"<a href=""http://twitter.com/download/iphone"" r...",The floofs have been released I repeat the flo...,,,NaT,https://twitter.com/dog_rates/status/820690176...,12,10,,,,,
516,810984652412424192,,,2016-12-19 23:06:23,"<a href=""http://twitter.com/download/iphone"" r...",Meet Sam. She smiles 24/7 &amp; secretly aspir...,,,NaT,"https://www.gofundme.com/sams-smile,https://tw...",34,10,Sam,,,,
784,775096608509886464,,,2016-09-11 22:20:06,"<a href=""http://twitter.com/download/iphone"" r...","RT @dog_rates: After so many requests, this is...",7.403732e+17,4.196984e+09,2016-06-08 02:41:38,https://twitter.com/dog_rates/status/740373189...,14,10,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
340,832215909146226688,,,2017-02-16 13:11:49,"<a href=""http://twitter.com/download/iphone"" r...","RT @dog_rates: This is Logan, the Chow who liv...",7.867091e+17,4.196984e+09,2016-10-13 23:23:56,https://twitter.com/dog_rates/status/786709082...,10,10,Logan,,,,
695,786709082849828864,,,2016-10-13 23:23:56,"<a href=""http://twitter.com/download/iphone"" r...","This is Logan, the Chow who lived. He solemnly...",,,NaT,https://twitter.com/dog_rates/status/786709082...,10,10,Logan,,,,
763,778027034220126208,,,2016-09-20 00:24:34,"<a href=""http://twitter.com/download/iphone"" r...",This is Sophie. She's a Jubilant Bush Pupper. ...,,,NaT,https://twitter.com/dog_rates/status/778027034...,11,10,Sophie,,,pupper,
1689,681340665377193984,6.813394486558024e+17,4196983835.0,2015-12-28 05:07:27,"<a href=""http://twitter.com/download/iphone"" r...",I've been told there's a slight possibility he...,,,NaT,,10,10,,,,,


In [60]:
# Add rows witch contain good notes (Concate twitter_archive_copy with merged_dataframe  )
dataframes = [twitter_archive_copy, merged_dataframe]
twitter_archive_copy = pd.concat(dataframes)
print
print(twitter_archive_copy.shape)

(2356, 17)


In [61]:
# Get the list of unique values of some columns witch were cleaned
getUniqueValues(twitter_archive_copy, ['rating_numerator', 'rating_denominator'])

[{'rating_numerator': [13,
   12,
   14,
   17,
   11,
   10,
   420,
   666,
   6,
   15,
   182,
   0,
   7,
   9,
   8,
   1,
   5,
   3,
   4,
   1776,
   2,
   34]},
 {'rating_denominator': [10]}]

In [62]:
# Get the twets for which the rating_numerator is greater than 15 or less than 7
tweets_less_note_7_and_sup_note_17 = twitter_archive_copy.query('7 > rating_numerator  or  rating_numerator > 17')
tweets_less_note_7_and_sup_note_17

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
188,855862651834028034,8.558615844633518e+17,194351775.0,2017-04-22 19:15:32,"<a href=""http://twitter.com/download/iphone"" r...",@dhmontgomery We also gave snoop dogg a 420/10...,,,NaT,,420,10,,,,,
189,855860136149123072,8.558585356070011e+17,13615722.0,2017-04-22 19:05:32,"<a href=""http://twitter.com/download/iphone"" r...",@s8n You tried very hard to portray this good ...,,,NaT,,666,10,,,,,
229,848212111729840128,,,2017-04-01 16:35:01,"<a href=""http://twitter.com/download/iphone"" r...",This is Jerry. He's doing a distinguished tong...,,,NaT,https://twitter.com/dog_rates/status/848212111...,6,10,Jerry,,,,
290,838150277551247360,8.381454986911949e+17,21955058.0,2017-03-04 22:12:52,"<a href=""http://twitter.com/download/iphone"" r...",@markhoppus 182/10,,,NaT,,182,10,,,,,
315,835152434251116546,,,2017-02-24 15:40:31,"<a href=""http://twitter.com/download/iphone"" r...",When you're so blinded by your systematic plag...,,,NaT,https://twitter.com/dog_rates/status/835152434...,0,10,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2349,666051853826850816,,,2015-11-16 00:35:11,"<a href=""http://twitter.com/download/iphone"" r...",This is an odd dog. Hard on the outside but lo...,,,NaT,https://twitter.com/dog_rates/status/666051853...,2,10,an,,,,
2351,666049248165822465,,,2015-11-16 00:24:50,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,NaT,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,NaT,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
516,810984652412424192,,,2016-12-19 23:06:23,"<a href=""http://twitter.com/download/iphone"" r...",Meet Sam. She smiles 24/7 &amp; secretly aspir...,,,NaT,"https://www.gofundme.com/sams-smile,https://tw...",34,10,Sam,,,,


In [63]:
for text in tweets_less_note_7_and_sup_note_17.text:
    print(text)
    print('________________________')

@dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research
________________________
@s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10
________________________
This is Jerry. He's doing a distinguished tongue slip. Slightly patronizing tbh. You think you're better than us, Jerry? 6/10 hold me back https://t.co/DkOBbwulw1
________________________
@markhoppus 182/10
________________________
When you're so blinded by your systematic plagiarism that you forget what day it is. 0/10 https://t.co/YbEJPkg4Ag
________________________
RT @dog_rates: Not familiar with this breed. No tail (weird). Only 2 legs. Doesn't bark. Surprisingly quick. Shits eggs. 1/10 https://t.co/…
________________________
Who keeps sending in pictures without dogs in them? This needs to stop. 5/10 for the mediocre road https://t.co/ELqelxWMrC
________________________
This is Wesley. He's clearly trespassing. Se

>1. tweet_json doit être une partie de la table `twitter_archive`

##### Définir
<ul>
    <li>Fusionner les deux tables(twitter_archive_copy et tweet_json_copy) en se basant sur la colonne tweet_id</li>
</ul>

##### Coder

In [64]:
# use the merge method to merge the two dataframe
twitter_archive_copy.merge(tweet_json_copy, on='tweet_id')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,retweet_count,favorite_count
0,892420643555336193,,,2017-08-01 16:23:56,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,NaT,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,6961,33648
1,892177421306343426,,,2017-08-01 00:17:27,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,NaT,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,5265,29193
2,891815181378084864,,,2017-07-31 00:18:03,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,NaT,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,3463,21954
3,891689557279858688,,,2017-07-30 15:58:51,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,NaT,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,,7182,36738
4,891327558926688256,,,2017-07-29 16:00:24,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,NaT,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,,7707,35137
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2320,832215909146226688,,,2017-02-16 13:11:49,"<a href=""http://twitter.com/download/iphone"" r...","RT @dog_rates: This is Logan, the Chow who liv...",7.867091e+17,4.196984e+09,2016-10-13 23:23:56,https://twitter.com/dog_rates/status/786709082...,10,10,Logan,,,,,5689,0
2321,786709082849828864,,,2016-10-13 23:23:56,"<a href=""http://twitter.com/download/iphone"" r...","This is Logan, the Chow who lived. He solemnly...",,,NaT,https://twitter.com/dog_rates/status/786709082...,10,10,Logan,,,,,5689,17293
2322,778027034220126208,,,2016-09-20 00:24:34,"<a href=""http://twitter.com/download/iphone"" r...",This is Sophie. She's a Jubilant Bush Pupper. ...,,,NaT,https://twitter.com/dog_rates/status/778027034...,11,10,Sophie,,,pupper,,1485,6169
2323,681340665377193984,6.813394486558024e+17,4196983835.0,2015-12-28 05:07:27,"<a href=""http://twitter.com/download/iphone"" r...",I've been told there's a slight possibility he...,,,NaT,,10,10,,,,,,244,1515


#### Tester

<ul>
    <li>Select all nondescriptive and misspelled column headers (ApplicationP, AboutC, RequiredQual, JobRequirment) and replace them with full words (ApplicationProcedure, AboutCompany, RequiredQualifications, JobRequirement)</li>
</ul>

In [65]:
df_clean = df_clean.rename(columns={'ApplicationP': 'ApplicationProcedure',
                                    'AboutC': 'AboutCompany',
                                    'RequiredQual': 'RequiredQualifications',
                                    'JobRequirment': 'JobRequirement'})

NameError: name 'df_clean' is not defined

#### Test

In [None]:
#twitter_archive_copy.rating_numerator = notes_with_decimals[notes_with_decimals['rating_numerator'].isin(twitter_archive_copy['rating_numerator'])]['rating_numerator'].values

In [None]:
print(twitter_archive_copy.query('tweet_id == "832215909146226688"'))
print(twitter_archive_copy(45))

In [None]:
for ident in notes_with_decimals.tweet_id:
    if ident in notes_with_decimals.tweet_id.tolist():
        print(ident)
        #twitter_archive_copy.rating_numerator[ident] = notes_with_decimals.rating_numerator[ident]
        #print(twitter_archive_copy.rating_numerator[ident])
        tweet_to_uddate = twitter_archive_copy.query('tweet_id == ident')

<ul>
    <li>Select all nondescriptive and misspelled column headers (ApplicationP, AboutC, RequiredQual, JobRequirment) and replace them with full words (ApplicationProcedure, AboutCompany, RequiredQualifications, JobRequirement)</li>
</ul>

#### Code

In [None]:
tweets_with_several_double = twitter_archive_copy.query('text == re.findall("([0-9]/+)")')

In [None]:
# Get list of all available values for StartDate column
values_to_replace = df_clean.StartDate.value_counts().keys()
list_of_values_to_replace = values_to_replace.tolist()
new_value = 'ASAP'
asap_list = ['Immediately', 'As soon as possible', 'Upon hiring',
'Immediate', 'Immediate employment', 'As soon as possible.', 'Immediate job opportunity',
'"Immediate employment, after passing the interview."',
'ASAP preferred', 'Employment contract signature date',
'Immediate employment opportunity', 'Immidiately', 'ASA',
'Asap', '"The position is open immediately but has a flexible start date depending on the candidates earliest availability."',
'Immediately upon agreement', '20 November 2014 or ASAP',
'immediately', 'Immediatelly',
'"Immediately upon selection or no later than November 15, 2009."',
'Immediate job opening', 'Immediate hiring', 'Upon selection',
'As soon as practical', 'Immadiate', 'As soon as posible',
'Immediately with 2 months probation period',
'12 November 2012 or ASAP', 'Immediate employment after passing the interview',
'Immediately/ upon agreement', '01 September 2014 or ASAP',
'Immediately or as per agreement', 'as soon as possible',
'As soon as Possible', 'in the nearest future', 'immediate',
'01 April 2014 or ASAP', 'Immidiatly', 'Urgent',
'Immediate or earliest possible', 'Immediate hire',
'Earliest  possible', 'ASAP with 3 months probation period.',
'Immediate employment opportunity.', 'Immediate employment.',
'Immidietly', 'Imminent', 'September 2014 or ASAP', 'Imediately']

# Convert values
for phrase in asap_list:
    df_clean.replace(phrase, new_value, inplace=True)

In [None]:
df_clean.StartDate.value_counts()

In [None]:
# Verify all things is done with assert statement
# If no error is raise, it means that it work well.
for phrase in asap_list:
    assert phrase not in df_clean.StartDate.values

>MAintenant nous avons fini avec le nettoyage de données.
>Essayons maintenant de faire une petite analyse avec ses données nettoyées :
>
>Calculer le pourcentage d'offres qui ont une date de d

In [None]:
# Get sum of asap value
asap_counts = df_clean.StartDate.value_counts()['ASAP']
asap_counts

In [None]:
# Get non empty StartDate
non_empty_counts = df_clean.StartDate.count()
non_empty_counts

In [None]:
# Calculate asap %
asap_pourcentage = (asap_counts / non_empty_counts)*100
asap_pourcentage

In [None]:
# Cre&te pie chart
df_clean.StartDate.value_counts().plot(kind='pie');