# Udacity Data Wrangling Project 

## Objective

In this project we are going to extract data from different sources related with the Twitter account `@dog_rates`.

Basicly we will work with four data sources:

* **'twitter-archive-enhanced.csv':**  A csv file with 2356 tweets of this account. Each one with a picture of a dog. They use to mark this dogs, usually with marks greater than 10 over 10: 11/10, 13/10, etc. This file is provided by Udacity for making the project.

* **'image-predictions.tsv':** A tsv file with the results obtained of applying a predictive method over the pictures of the tweets. This file was obtained in a project in another nanodegreee and it is provided by Udacity also. Every image in the WeRateDogs Twitter archive was run through a neural network that can classify breeds of dogs. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

* **Additional information obtained with tweetpy:** Once obtained the Twitter credentials, we are going to use the tweetpy API to get more additional data. To do this we will connect to the Twitter platform and using tweetpy we will download the tweet status for each tweet in `twitter-archive-enhanced.csv`. Then we will save these results in a file callde `twitter_archive.json` using the json library. Finally, we will read this file and we will extract some more data to another dataframe using json again.

* **Information abour the replies of each tweet:** Finally we would like to extract the data corresponding to the replies of each tweet. We have tried some different methods:

    * In some places it is recommended to use tweepy to make a query of all the tweets referenced to @rate_dogs, and search which of them are a reply to the status of the tweet. Translated to code, something like this:
    
            consumer_key = 'XXXXXX'
            consumer_secret = 'XXXXXX'
            access_token = 'XXXXXX'
            access_secret = 'XXXXXX'

            auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
            auth.set_access_token(access_token, access_secret)

            twapi = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
    
            replies=[]

            for tweet in tweepy.Cursor(twapi.search,q='to:'+name, since_id=892420643555336193, result_type='recent',timeout=999999).items(1000)
                if hasattr(tweet, 'in_reply_to_status_id_str'):
                    if (tweet.in_reply_to_status_id_str==tweet_id):
                        replies.append(tweet)
            
      but it has a lot of limitations and I didn't like it too much.

    * In other places it is recommended to use the urllib3 library to request pages. Then, you can use BeautifulSoup to interpret the result and scrapp the information that you need:
    
             http = urllib3.PoolManager()
             url = "https://twitter.com/dog_rates/status/892420643555336193"
             r = http.request('GET', url)
             soup = BeautifulSoup(r.data)
             tweets = soup.find_all('li','js-stream-item')
             for tweet in tweets:

             full_name = tweet.find("span", "FullNameGroup").find("strong", "fullname").contents[0]  
        
      But, in this case, you need to make scroll down on the page to see all the replies. Even so, when there are too many replies, the page cut the list and ask you in a link if you want to see more. You had to do this as many times as you need until you reach the end of the list. Apart from that, sometimes there are replies to the replies, and the page has another link to select to see them. I mean that with a single request you can't see all the replies if these are a lot.
      Maybe yo can do that using additional requests with POST or some other commands and sending the correct instruction to click in all the necessary links. But I felt like it was too much complicated.
       
    * Finally, I tried another method to do scrapping. I used the `selenium` library. It permits you to use a local browser to open the pages. You can navigate using the program on these pages and select and click any element of the page. Once you have deployed completely the page you can get it to a beautifulsoup object and interpret it. And using a local browser can be viewed as a disadvantage, but I felt more comfortable with this method and it is what I have used.




### Imports

In [1]:
#basic data libraries
import pandas as pd
import numpy as np
#to interact with the local system
import os
import sys
#to work with regular expressions
import re
#imports to user timers and make conversions of time formats
from timeit import default_timer as timer
from datetime import datetime
import time
#to make logs and track those processes that take a long time
import logging
#to get and interpret information of the web
import requests
import tweepy
from tweepy import OAuthHandler
import json
from bs4 import BeautifulSoup
import urllib.parse
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException



### Gathering The Data

* The first step will be to define a function to connect to Twitter using the API `tweepy`. We will use this function in other cells below. 

In [11]:
def connect_twitter():
    '''
    It connects to Twitter API.
    
    Returns:
        twapi: tweepy.api object to interact with the page.
    '''
    
    #It reads the keys to connect to Twitter API from a local file.
    #These keys are hidden to comply with Twitter's API terms and conditions
    with open('API keys.txt', mode = 'r') as file:
        keys = file.readlines()
        
    keys = [x.strip() for x in keys] 
    
    consumer_key = keys[0].split(":")[1]
    consumer_secret = keys[1].split(":")[1]
    access_token = keys[2].split(":")[1]
    access_secret = keys[3].split(":")[1]
    
    #It authenticates in tweepy with the previous credentials.
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)
    twapi = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
    
    return twapi


* Create the folder where to save the necesary files.

In [2]:
#It creates a folder called resources if it does not exists
folder_name = 'resources'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)


* Load data from `twitter-archive-enhanced.csv` file supplied by Udacity. This dataframe has the following columns:

    - **tweet_id:** The integer representation of the unique identifier for this Tweet. 
    - **in_reply_to_status_id:** If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s ID.
    - **in_reply_to_user_id:**  If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s author ID. This will not necessarily always be the user directly mentioned in the Tweet.
    - **timestamp:** date and time of the tweet.
    - **source:** Utility used to post the Tweet, as an HTML-formatted string. Tweets from the Twitter website have a source value of web.
    - **text:** The actual UTF-8 text of the status update. 
    - **retweeted_status_id:** If the represented Tweet is a retweet, this field will contain the integer representation of the original Tweet’s ID. If it is a retweet of a retweet it containg the original message id.
    - **retweeted_status_user_id:**  If the represented Tweet is a retweet, this field will contain the integer representation of the original Tweet’s author ID. This will not necessarily always be the user directly mentioned in the Tweet.
    - **retweeted_status_timestamp:** If the represented Tweet is a retweet, the timestampo of the original tweet.
    - **expanded_urls:** url of the tweet.
    - **rating_numerator:** numerator of the rating assigned according to the text of the tweet.
    - **rating_denominator:** denominator of the rating assigned according to the text of the tweet.
    - **name:** name of the dog according to the text of the tweet.
    - **doggo:** type of the dog acording to the text and to the clasification used in the page.
    - **floofer:** type of the dog acording to the text and to the clasification used in the page.
    - **pupper:** type of the dog acording to the text and to the clasification used in the page.
    - **puppo:** type of the dog acording to the text and to the clasification used in the page. 
    
    
![alt text](dogtionary-combined.png)


In [64]:
#Load the file twitter-archive-enhanced.csv into a dataframe
df_twitter_archive_enhanced = pd.read_csv(os.path.join(folder_name, 'twitter-archive-enhanced.csv'))

* Load data from the `image-predictions.tsv` file. This file was provided by Udacity in a especified url.

In [3]:
#We donwload the image-predictions.tsv file from the expecified url.
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open(os.path.join(folder_name, url.split('/')[-1]), mode = 'wb') as file:
        file.write(response.content)

* We crate a new dataframe called `df_image_predictions`. This dataframe has the following columns:

    - **tweet_id:** tweet_id is the last part of the tweet URL after "status/": `https://twitter.com/dog_rates/status/889531135344209921`
    
    - **jpg_url:** url of the image of the tweet. It can be downloaded.
    - **img_num:** the image with the most confident prediction.
    - **p1:** is the algorithm's #1 prediction for the image in the tweet.
    - **p1_conf:** is how confident the algorithm is in its #1 prediction.
    - **p1_dog:** is whether or not the #1 prediction is a breed of dog.
    - **p2:** is the algorithm's second most likely prediction.
    - **p2_conf:** is how confident the algorithm is in its #2 prediction.
    - **p2_dog:** is whether or not the #2 prediction is a breed of dog.
    - **p3:** is the algorithm's third most likely prediction.
    - **p3_conf:** is how confident the algorithm is in its #3 prediction.
    - **p3_dog:** is whether or not the #3 prediction is a breed of dog.

In [63]:
#Load the file image-predictions.tsv into the dataframe df_image_predictions
df_image_predictions = pd.read_csv(os.path.join(folder_name, 'image-predictions.tsv'), sep='\t')

* Load additional data from the pages of the tweets with the API tweepy. At the same time, we can see which pages still exist and which are not available in this moment. First, we download the content of the page of each tweet and we store it in a file calle `twitter_archive.json`, in json format.

In [7]:
=================================================
REMOVE THIS TO EXECUTE IT. IT CAN TAKES MORE THAN AN HOUR.
=================================================
#we connect to the Twitter API using tweepy 
twapi = connect_twitter()

start = timer()
#It configures the file 'tweepy_api.log' as a log to track the evolution
logging.basicConfig(filename='tweepy_api.log',level=logging.DEBUG)

#It initializes a list of Id's with all the tweets.
list_ids = df_twitter_archive_enhanced.tweet_id

total_count = 0
error_count = 0

#It initializes the log file
open('tweepy_api.log', 'w').close()

#It initializes the files used to save the results.
#We have created two files: one estructured in lines and other indented for a more
#friendly check.
open(os.path.join(folder_name, 'twitter_archive_indent.json'), 'w').close()
open(os.path.join(folder_name, 'twitter_archive.json'), 'w').close()

for tweet_id in list_ids:
    total_count += 1
    logging.debug('%s: Trying tweet for ID %s', total_count, tweet_id)
    try:
        #download the content of a tweet for a tweet_id given
        tweet = twapi.get_status(tweet_id, tweet_mode='extended')
        #store the content of the tweet using json in the file tweet_json_indent.txt, indent=2 spaces.
        with open(os.path.join(folder_name, 'twitter_archive_indent.json'), 'a', encoding='utf8', newline='\n') as out_file:
            json.dump(tweet._json, out_file, indent=2, ensure_ascii=False)
            out_file.write('\n')
        #store the content of the tweet using json in the file tweet_json.txt, in a sigle line.
        with open(os.path.join(folder_name, 'twitter_archive.json'), 'a', encoding='utf8', newline='\n') as out_file:
            json.dump(tweet._json, out_file, ensure_ascii=False)
            out_file.write('\n')
            
        #separate each tweet in the log file.    
        logging.debug('============================================================================')
        logging.debug('============================================================================')
    except tweepy.TweepError as te:
        #if we cannot download the tweet, we reflect this in the log and we increment the error count.
        logging.warning('%s: FAILED to get tweet ID %s: %s', total_count, tweet_id, str(te))
        error_count += 1
                    
    end = timer()
    #separate each tweet in the log file. 
    logging.debug('TOTAL: %s: TIME %s%s', total_count, end-start,'===========================================')
    logging.debug('TOTAL: %s. ERRORS: %s%s', total_count, error_count,'===========================================')
    logging.debug('============================================================================')
    logging.debug('============================================================================')
        


* Total count of tweets in this new file and number of errors. The errors are pages that existed when Udacity extracted the file `twitter-archive-enhanced.csv`, but now are unavailable.

In [8]:
total_count, error_count

(2356, 23)

* Use the data stored in the file `twitter_archive.json` in the previous step to create a new dataframe called `df_tweepy_extractions`. This dataframe will have the following columns:

    - **tweet_id:** The integer representation of the unique identifier for this Tweet.
    - **entities_name:** Users who are labelled under the picture of the tweet.
    - **entities_screen_name:** Screen name of the users who are labelled under the picture of the tweet.
    - **entities_type:** The type of the entity. In this case is always 'user'
    - **entities_user_id:** ID of the users who are labelled under the picture of the tweet.
    - **favorite_count:**  Indicates approximately how many times this Tweet has been liked by Twitter users. 
    - **favorites_count_retweet:** This field only surfaces when the Tweet is a retweet. Indicates approximately how many times the original Tweet has been liked by Twitter users. 
    - **mentions_name:** Display name of the referenced user in the text of the tweet.
    - **mentions_screen_name:** Screen name of the referenced user in the text of the tweet.
    - **mentions_user_id:** ID of the mentioned user, as an integer in the text of the tweet.
    - **quoted_status_id:** This field only surfaces when the Tweet is a quote Tweet. This field contains the integer value Tweet ID of the quoted Tweet. 
    - **quoted_user_id:** ID of the user quoted.
    - **quoted_status_id_rwetweet:** This field only surfaces when the Tweet is a retweet and the original Tweet is a is a quote Tweet. This field contains the integer value Tweet ID of the quoted Tweet.
    - **retweet_count:** Number of times this Tweet has been retweeted.
    - **retweet_count_retweet:** This field only surfaces when the Tweet is a retweet. Indicates approximately how many times the original Tweet has been retweeted.
    

In [47]:
#we first read the file and load the lines in a list called content
with open(os.path.join(folder_name, 'twitter_archive.json'), 'r', encoding='utf8') as input_file:
    content = input_file.readlines()
content = [x.strip() for x in content] 

#initialize the result dataframe df_tweepy_extractions
df_tweepy_extractions = pd.DataFrame()
tweet_status = {}

#read recursively each line in the list content
for line in content:
    #initialize the outcomes
    entities_name = ''
    entities_screen_name  = ''
    entities_type = ''
    entities_user_id = ''
    mentions_user_id = ''
    mentions_name = ''
    mentions_screen_name = ''
    favorites_count_retweet = ''
    retweet_count_retweet = ''
    quoted_status_id_rwetweet = ''
    quoted_status_id_str = ''
    
    #read each string with json.loads to interpretarte it
    tweet_status = json.loads(line)
    try:
        #if the object media exists innside entities.
        if 'media' in tweet_status['entities']:
            #In this case we are going to see the user or users tagged below the picture
            if 'all' in tweet_status['entities']['media'][0]['features']:
                #a list with the names of the users tagged
                entities_name = [name['name'] for name in tweet_status['entities']['media'][0]['features']['all']['tags']]
                #the screen names eje:@bla_bla_bla
                entities_screen_name = [screen_name['screen_name'] for screen_name in tweet_status['entities']['media'][0]['features']['all']['tags']]
                #types: user
                entities_type = [type_['type'] for type_ in tweet_status['entities']['media'][0]['features']['all']['tags']]
                #a list with the users id
                entities_user_id = [user_id['user_id'] for user_id in tweet_status['entities']['media'][0]['features']['all']['tags']]
        #if the object user_mentions exists innside entities. We can search the users named inside te text part.
        if 'user_mentions' in tweet_status['entities']:
            if len(tweet_status['entities']['user_mentions']) > 0:
                #the id of the user mentioned
                mentions_user_id = [user_id['id_str'] for user_id in tweet_status['entities']['user_mentions']]
                #the name of the user mentioned
                mentions_name = [name['name'] for name in tweet_status['entities']['user_mentions']]
                #the screen name of the user mentioned
                mentions_screen_name = [screen_name['name'] for screen_name in tweet_status['entities']['user_mentions']]
        #if the tweet is a retweet
        if 'retweeted_status' in tweet_status:
            if 'favorite_count' in tweet_status['retweeted_status']:
                #number of favorites in the original tweet.
                favorites_count_retweet = tweet_status['retweeted_status']['favorite_count']
                #number of retweets in the original tweet.
                retweet_count_retweet = tweet_status['retweeted_status']['retweet_count']
            if 'quoted_status_id_str' in tweet_status['retweeted_status']:
                #if the tweet is a retweet of a previously quoted tweet. The tweet id of the original quoted tweet.
                quoted_status_id_rwetweet = tweet_status['retweeted_status']['quoted_status_id_str']
        #if the tweet is a quoted ot other tweet.
        if 'quoted_status_id_str' in tweet_status:
            #the id of the quoted tweet.
            quoted_status_id_str = tweet_status['quoted_status_id_str']
            if 'quoted_status' in tweet_status:
                #id of the user quoted.
                quoted_user_id_str = tweet_status['quoted_status']['user']['id_str']
    
    except Exception as e:
        #register in the log any exception that it can occurs.
        logging.warning(e)
    #save the results in the dataframe df_tweepy_extractions
    df_tweepy_extractions = df_tweepy_extractions.append({'tweet_id': tweet_status['id_str'],
                                                          'retweet_count': tweet_status['retweet_count'],
                                                          'favorite_count': tweet_status['favorite_count'],
                                                          'favorites_count_retweet': favorites_count_retweet,
                                                          'retweet_count_retweet': retweet_count_retweet,
                                                          'entities_name': entities_name,
                                                          'entities_screen_name': entities_screen_name,
                                                          'entities_type': entities_type,
                                                          'entities_user_id': entities_user_id,
                                                          'mentions_user_id': mentions_user_id,
                                                          'mentions_name': mentions_name,
                                                          'mentions_screen_name': mentions_screen_name,
                                                          'quoted_status_id': quoted_status_id_str,
                                                          'quoted_user_id': quoted_user_id_str,
                                                          'quoted_status_id_rwetweet': quoted_status_id_rwetweet
                                                          },ignore_index=True)


#### Scrapping Replies Using Selenium

* Finally we are going to sacrap more information about each tweet using the library selenium. We are interested into obtain information about all the replies for each tweet. We will get a final dataframe called `df_scrapped_replies` with the following columns:

    - **conversation:** Id of the replied tweet.
    - **favs:** Number of favorites for this replying tweet.
    - **full_name:** name of the user who has replied.
    - **image:** If there is an image in the reply, it especifies the url.
    - **language:** When present, indicates a BCP 47 language identifier corresponding to the machine-detected language of the Tweet text.
    - **references:** Other users ID that are referenced in the text of the reply, if they exist.
    - **replies:** Number of replies to this reply.
    - **reply_id:** tweet ID for this reply.
    - **retweets:** Number of retweets of this reply.
    - **text:** Text include in the reply.
    - **timestamp:** Date_time of the reply.
    - **user_id:** Id of the user who has replied.
    - **user_name:** Name of the user who has replied (@XXXXX).


* This fuction download a status page with all its replies into a driver object. We have user the Firefox driver.

In [48]:
def download_page(driver, user_name, conversation_id):

    '''
    This fuction download a status page with all its replies into a driver object. We have user the Firefox driver.
    
    Args:
        driver: selenium.webdriver object. Used to get the page, to move on it and make actions.
        (str) user_name: user name of the twitter profile. In this project: dog_rates.
        (str) conversation_id: status or tweet id for which we want to extract the data.
    
    Return:
    
        driver: selenium.webdriver object. The same object, but with all the replies to the tweet opened
    
    '''
    
    #Initialize a new file for the log.
    logging.basicConfig(filename='scrapping_replies.log',level=logging.DEBUG)
    #url of the page that we want to download
    url = "https://twitter.com/" + user_name + "/status/" + conversation_id
    #time to wait after each scroll
    SCROLL_PAUSE_TIME = 2
    # tells WebDriver to poll the DOM for a certain amount of time when trying 
    #to find any element (or elements) not immediately available.
    driver.implicitly_wait(10)
    
    driver.get(url)
    #we use the length of the page to know if we have downloaded the complete page.
    last_length = len(driver.page_source)
    
    count = 0
    while True:
        #scroolls down to the end of the page
        driver.find_element_by_tag_name("body").send_keys(Keys.END)
        time.sleep(SCROLL_PAUSE_TIME)
        #the page is actualized with more replies. We get the new length
        new_length = len(driver.page_source)
        count += 1
        if count == 6:
            count = 0
        #Even when we have reached the end we wait 4*two weconds, just in case the page is not complete downloaded
        if (new_length == last_length) & (count > 4):
            count = 0
            #When the page is completely downloaded and there are not more replies to show, 
            #we search for the le link 'show nore replies' and we click it.
            try:             
                button_more = driver.find_element_by_css_selector('.ThreadedConversation-showMoreThreadsButton.u-textUserColor')
                button_more.click()
            except (NoSuchElementException, AttributeError) as e:
                logging.warning(e)
                break
        #se set the old length equal to the new length to start again the scroll down process.
        last_length = new_length
    #When there are not more replies neither any 'Show nore replies' link.
    #We click in all the intermediate links 'x replies more'. Sometimes people replies to the replies and this other replies
    #are not always showed at first.
    try:
        links_replies = driver.find_elements_by_css_selector('.ThreadedConversation-moreRepliesLink')
        for link in links_replies:
            link.click()
    except (NoSuchElementException, AttributeError) as e:
        logging.warning(e)
    #return the driver object with the complete page.
    return driver

* This function searchs for information inside the page stored in driver and it saves the content in the dataframe `df_scrapping_replies`.

In [51]:
def analize_page(driver, conversation_id):
    
    '''
    It searchs for information inside the page stored in driver and it saves the content in the dataframe df_scrapping_replies.
    
    Args:
        driver: selenium.webdriver object. Used to get the page, to move on it and make actions.
        (str) conversation_id: status or tweet id for which we want to extract the data.
    
    Return:
    
        df_scrapping_replies: pandas.dataframe object with all the data gathered.
    '''
    
    #it configures the file for the log.
    logging.basicConfig(filename='scrapping_replies.log',level=logging.DEBUG)
    #it initializes the dataframe df_scrapping_replies
    df_scrapping_replies = pd.DataFrame()
    #call BeautifulSoup with the driver page to be decoded
    soup = BeautifulSoup(driver.page_source, "html.parser")
    #load the content in a list of replies
    tweets = soup.find_all('li','js-stream-item')

    for tweet in tweets:
        #analyze each reply one by one
        try:
            #the status id of the reply
            reply_id = tweet.get('data-item-id')
            full_name = ""
            #we get only the text part of the name. There is other parts like emojis.
            full_names = tweet.find("span", "FullNameGroup").find("strong", "fullname").contents
            for name in full_names:
                if isinstance(name,  str):
                    full_name = full_name + name
                    
            #the name with @
            user_name = tweet.find("span", "username").find("b").contents[0].strip()
            #the id of the user.
            user_id = tweet.find("div",class_=re.compile("^tweet js-stream-tweet")).get('data-user-id')
            logging.debug(user_id)
            #The number of replies to this reply, number ot retweets and faver of this reply.
            replies = tweet.find("span", id=re.compile("^profile-tweet-action-reply-count")).contents[0]
            retweets = tweet.find("span", id=re.compile("^profile-tweet-action-retweet-count")).contents[0]
            favs = tweet.find("span", id=re.compile("^profile-tweet-action-favorite-count")).contents[0]
            #The language of the message if it is configured.
            language = tweet.find("p", "TweetTextSize js-tweet-text tweet-text").get('lang')
            #In the text part we only get the str part. We discard emojis and other things.
            texts = tweet.find("p", "TweetTextSize js-tweet-text tweet-text").contents
            text = ""
            ref = ""
            image = ""
            for subtext in texts:
                if isinstance(subtext,  str):
                    text = text + subtext
            ref_aux = tweet.find_all("a", "pretty-link js-user-profile-link")
            #the other users id that are referred in the text of the reply
            for subref in ref_aux:
                subrefs = subref.get('data-user-id')
                ref.append(subrefs)
            #We try to find if there is some image attached. If it is so, we save the url of the picture.
            try:
                image = tweet.find("div", "AdaptiveMedia-photoContainer js-adaptive-photo").get('data-image-url') 
            except AttributeError as e:
                logging.warning(e)
            #we also get the date_time of the reply
            timestamp = tweet.find("small", "time").find("span", "_timestamp js-short-timestamp").get('data-time')
            timestamp = str(datetime.fromtimestamp(int(timestamp)))
            
            #se save all in the dataframe df_scrapping_replies 
            df_scrapping_replies = df_scrapping_replies.append({'timestamp': timestamp,
                                                                'conversation': conversation_id,
                                                                'reply_id': reply_id,
                                                                'full_name': full_name,
                                                                'user_name': '@' + user_name,
                                                                'user_id': user_id,
                                                                'image': image,
                                                                'replies': int(replies.split()[0].replace('.','')),
                                                                'retweets': int(retweets.split()[0].replace('.','')),
                                                                'favs': int(favs.split()[0].replace('.','')),
                                                                'text': text,
                                                                'language': language,
                                                                'references': ref
                                                                },ignore_index=True)

        except AttributeError as e:
            logging.warning(e)
    
    #It reurns df_scrapping_replies 
    return df_scrapping_replies
    

* We call the neccessary functions to make the scrap. As it takes a long time, each time that we search the replies of a tweet we save them in the csv file. Instead of saving the complete file at the end. If there were any problem we would have save the results until this moment.

In [24]:
=================================================
REMOVE THIS TO EXECUTE IT. IT CAN TAKES SEVERAL HOURS TO FINISH
=================================================
start = timer()
#initialize the log file
open('scrapping_replies.log', 'w').close()
logging.basicConfig(filename='scrapping_replies.log',level=logging.DEBUG)

logging.debug("********************** CONNECTING DRIVER TO PAGE **********************")
#initialize the driver.
twapi = connect_twitter()
driver = webdriver.Firefox()
logging.debug("********************** CONNECTION DONE!!!! **********************")
user_name = 'dog_rates'
i = 1
try:
    #search for each tweet id in df_tweepy_extractions (the tweets that we know that are available in this moment)
    for tweet_id in df_tweepy_extractions['tweet_id']:
        #call the download_page function to download the complete page
        logging.debug("********************** START DOWNLOAD: " + tweet_id + "(" + str(i) + ")" + ' **********************')
        driver = download_page(driver, user_name, tweet_id)
        logging.debug("********************** END DOWNLOAD: " + tweet_id + "(" + str(i) + ")" + ' **********************')
        #call the analize_page function to extract a dataframe with the results
        logging.debug("********************** START ANALYSIS: " + tweet_id + "(" + str(i) + ")" + ' **********************')
        df_scrapping_replies = analize_page(driver, tweet_id)
        #The first time that we save the resutls in scrapped_replies.csv we open the file in write mode and we write a header.
        #The next times we open the file in append mode and we don't write the header.
        if i == 1:
            df_scrapping_replies.to_csv(os.path.join(folder_name, 'scrapped_replies.csv'), mode='w', encoding='utf-8', index=False)
        else:
            df_scrapping_replies.to_csv(os.path.join(folder_name, 'scrapped_replies.csv'), mode='a', encoding='utf-8', index=False, header = False)
        end = timer()
        #each time we save a result. we register in the log file the number of tweet_id analyzed and the time consumed.
        logging.debug("********************** END ANALYSIS: " + tweet_id + "(" + str(i) + ")" + str(start-end) + ' **********************')
        i += 1
except Exception as e:
    logging.warning(e)
finally:
    #close and disconnect the driver.
    driver.quit()


SyntaxError: invalid syntax (<ipython-input-24-a10d49617378>, line 1)

* We load the data saved in `scrapped_replies.csv` in the steps before to the dataframe `df_scrapped_replies`. We can see that there are 23 tweets with no replies. We have 2310 rows in df_scrapped_replies and 2333 in df_tweepy_extractions.

In [20]:
df_scrapped_replies = pd.read_csv(os.path.join(folder_name, 'scrapped_replies.csv'))
df_scrapped_replies.conversation.nunique()

2310

In [29]:
df_tweepy_extractions.shape

(2333, 14)

In [31]:
#df_scrapped_replies.conversation = df_scrapped_replies.conversation.astype(str)

These are the list of tweets without any replies.

In [36]:
df_tweepy_extractions[~df_tweepy_extractions.tweet_id.isin(df_scrapped_replies.conversation)].tweet_id

1941    673350198937153538
2054    670833812859932673
2063    670803562457407488
2184    668627278264475648
2190    668567822092664832
2193    668537837512433665
2198    668480044826800133
2201    668291999406125056
2241    667538891197542400
2245    667517642048163840
2257    667393430834667520
2264    667177989038297088
2286    666804364988780544
2289    666776908487630848
2292    666691418707132416
2302    666418789513326592
2304    666407126856765440
2316    666102155909144576
2318    666094000022159362
2320    666073100786774016
2325    666055525042405380
2327    666050758794694657
2331    666029285002620928
Name: tweet_id, dtype: object

## Assesing The Data

* `df_twitter_archive_enhanced`

In [137]:
df_twitter_archive_enhanced.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2011,672245253877968896,,,2015-12-03 02:45:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Snickers. He's adorable. Also comes in t-...,,,,https://twitter.com/dog_rates/status/672245253...,12,10,Snickers,,,,
2239,667937095915278337,,,2015-11-21 05:26:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This dog resembles a baked potato. Bed looks u...,,,,https://twitter.com/dog_rates/status/667937095...,3,10,,,,,
706,785533386513321988,,,2016-10-10 17:32:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Dallas. Her tongue is ridiculous. 11/1...,,,,https://twitter.com/dog_rates/status/785533386...,11,10,Dallas,,,,
428,821149554670182400,,,2017-01-17 00:18:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Luca. He got caught howling. H*ckin em...,,,,https://twitter.com/dog_rates/status/821149554...,12,10,Luca,,,,
1872,675146535592706048,,,2015-12-11 02:54:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Coops. He's yelling at the carpet. Not...,,,,https://twitter.com/dog_rates/status/675146535...,7,10,Coops,,,,


In [71]:
df_twitter_archive_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [105]:
#Text for the retweeted tweets.
df_twitter_archive_enhanced.query('retweeted_status_id.notnull() == True').text

19      RT @dog_rates: This is Canela. She attempted s...
32      RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...
36      RT @dog_rates: This is Lilly. She just paralle...
68      RT @dog_rates: This is Emmy. She was adopted t...
73      RT @dog_rates: Meet Shadow. In an attempt to r...
                              ...                        
1023    RT @dog_rates: This is Shaggy. He knows exactl...
1043    RT @dog_rates: Extremely intelligent dog here....
1242    RT @twitter: @dog_rates Awesome Tweet! 12/10. ...
2259    RT @dogratingrating: Exceptional talent. Origi...
2260    RT @dogratingrating: Unoriginal idea. Blatant ...
Name: text, Length: 181, dtype: object

In [109]:
#unique numerators and denominators.
df_twitter_archive_enhanced.rating_numerator.unique(), df_twitter_archive_enhanced.rating_denominator.unique()

(array([  13,   12,   14,    5,   17,   11,   10,  420,  666,    6,   15,
         182,  960,    0,   75,    7,   84,    9,   24,    8,    1,   27,
           3,    4,  165, 1776,  204,   50,   99,   80,   45,   60,   44,
         143,  121,   20,   26,    2,  144,   88], dtype=int64),
 array([ 10,   0,  15,  70,   7,  11, 150, 170,  20,  50,  90,  80,  40,
        130, 110,  16, 120,   2], dtype=int64))

In [117]:
#numerator when the denominator = 10
df_twitter_archive_enhanced.query('rating_denominator == 10').rating_numerator.unique()

array([  13,   12,   14,    5,   17,   11,   10,  420,  666,    6,   15,
        182,    0,   75,    7,    9,    8,    1,   27,    3,    4, 1776,
         26,    2], dtype=int64)

In [132]:
#names with length smaller than 4 chars
df_twitter_archive_enhanced.query('name.str.len() < 4').name.unique()

array(['Jax', 'Ted', 'Jim', 'Gus', 'Rey', 'a', 'Aja', 'Jed', 'Leo', 'Ken',
       'Max', 'Ava', 'Eli', 'Ash', 'not', 'Mia', 'one', 'Ike', 'Mo', 'Bo',
       'Tom', 'Alf', 'Sky', 'Tyr', 'Moe', 'Sam', 'Ito', 'Doc', 'mad',
       'Jay', 'Mya', 'an', 'O', 'Al', 'Lou', 'my', 'Eve', 'Dex', 'Ace',
       'Zoe', 'Blu', 'his', 'all', 'Sid', 'old', 'Ole', 'Bob', 'the',
       'Obi', 'by', 'Evy', 'Tug', 'Jeb', 'Dot', 'Mac', 'Ed', 'Taz', 'Cal',
       'JD', 'Pip', 'Amy', 'Gin', 'Edd', 'Ben', 'Dug', 'Jo', 'Ron', 'Stu'],
      dtype=object)

In [135]:
#text for dogs with name 'O'
df_twitter_archive_enhanced.query('name == "O"').text

775    This is O'Malley. That is how he sleeps. Doesn...
Name: text, dtype: object

In [138]:
#text for dogs with name 'by'
df_twitter_archive_enhanced.query('name == "by"').text

1724    This is by far the most coordinated series of ...
Name: text, dtype: object

In [173]:
#sum of number of dogs for each clasification
t = (df_twitter_archive_enhanced.query('doggo != "None"').tweet_id.count(),
df_twitter_archive_enhanced.query('floofer != "None"').floofer.count(),
df_twitter_archive_enhanced.query('pupper != "None"').pupper.count(),
df_twitter_archive_enhanced.query('puppo != "None"').puppo.count())
sum(t)

394

In [164]:
#number of dogs with at least one clasification.
df_twitter_archive_enhanced[df_twitter_archive_enhanced['doggo'].str.contains("doggo") | 
                            df_twitter_archive_enhanced['floofer'].str.contains("floofer") |
                            df_twitter_archive_enhanced['pupper'].str.contains("pupper") |
                            df_twitter_archive_enhanced['puppo'].str.contains("puppo")].shape

(380, 17)

In [192]:
#tweets with two clasifications
df_twitter_archive_enhanced[
    df_twitter_archive_enhanced[["doggo","floofer","pupper","puppo"]].
                            isin(["doggo","floofer","pupper","puppo"]).sum(axis=1)> 1][['tweet_id',"doggo","floofer","pupper","puppo"]]

Unnamed: 0,tweet_id,doggo,floofer,pupper,puppo
191,855851453814013952,doggo,,,puppo
200,854010172552949760,doggo,floofer,,
460,817777686764523521,doggo,,pupper,
531,808106460588765185,doggo,,pupper,
565,802265048156610565,doggo,,pupper,
575,801115127852503040,doggo,,pupper,
705,785639753186217984,doggo,,pupper,
733,781308096455073793,doggo,,pupper,
778,775898661951791106,doggo,,pupper,
822,770093767776997377,doggo,,pupper,


* `df_image_predictions`

In [193]:
df_image_predictions.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1425,772193107915964416,https://pbs.twimg.com/media/Crdhh_1XEAAHKHi.jpg,1,Pembroke,0.367945,True,Chihuahua,0.223522,True,Pekinese,0.164871,True
1028,710997087345876993,https://pbs.twimg.com/media/Cd34FClUMAAnvGP.jpg,1,malamute,0.28126,True,Eskimo_dog,0.232641,True,Pembroke,0.091602,True
1227,745314880350101504,https://pbs.twimg.com/media/Clfj6RYWMAAFAOW.jpg,2,ice_bear,0.807762,False,great_white_shark,0.02704,False,fountain,0.022052,False
886,699036661657767936,https://pbs.twimg.com/media/CbN6IW4UYAAyVDA.jpg,1,Chihuahua,0.222943,True,toyshop,0.179938,False,Weimaraner,0.163033,True
1953,863907417377173506,https://pbs.twimg.com/media/C_03NPeUQAAgrMl.jpg,1,marmot,0.358828,False,meerkat,0.174703,False,weasel,0.123485,False


In [194]:
df_image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [208]:
#tweets that have not been predicted as a dog in any of the three predictions.
df_image_predictions.query('p1_dog == False & p2_dog == False & p3_dog == False')[['img_num',
                                                                                   'tweet_id',
                                                                                   'p1','p1_dog',
                                                                                   'p2','p2_dog',
                                                                                   'p3','p3_dog']]

Unnamed: 0,img_num,tweet_id,p1,p1_dog,p2,p2_dog,p3,p3_dog
6,1,666051853826850816,box_turtle,False,mud_turtle,False,terrapin,False
17,1,666104133288665088,hen,False,cock,False,partridge,False
18,1,666268910803644416,desktop_computer,False,desk,False,bookcase,False
21,1,666293911632134144,three-toed_sloth,False,otter,False,great_grey_owl,False
25,1,666362758909284353,guinea_pig,False,skunk,False,hamster,False
...,...,...,...,...,...,...,...,...
2021,1,880935762899988482,street_sign,False,umbrella,False,traffic_light,False
2022,1,881268444196462592,tusker,False,Indian_elephant,False,ibex,False
2046,1,886680336477933568,convertible,False,sports_car,False,car_wheel,False
2052,1,887517139158093824,limousine,False,tow_truck,False,shopping_cart,False


* `df_tweepy_extractions`

In [249]:
df_tweepy_extractions.sample(10)

Unnamed: 0,entities_name,entities_screen_name,entities_type,entities_user_id,favorite_count,favorites_count_retweet,mentions_name,mentions_screen_name,mentions_user_id,quoted_status_id,quoted_status_id_rwetweet,retweet_count,retweet_count_retweet,tweet_id
1350,[johnny yuen],[johnny167167],[user],[393334099],4876.0,,,,,,,1361.0,,702217446468493312
1214,,,,,5411.0,,,,,,,1584.0,,712438159032893441
98,,,,,14144.0,,,,,,,3493.0,,872820683541237760
1590,[Dan Morrow],[DanielAMorrow],[user],[71047424],2700.0,,,,,,,656.0,,685321586178670592
1511,,,,,1452.0,,,,,,,378.0,,689999384604450816
638,,,,,0.0,9775.0,[WeRateDogs®],[WeRateDogs®],[4196983835],,,3986.0,3986.0,791780927877898241
2203,,,,,809.0,,,,,,,227.0,,668274247790391296
2017,,,,,1945.0,,,,,,,1030.0,,671544874165002241
453,[Kelli Adrian],[kadrian_15],[user],[2449586958],0.0,17348.0,[WeRateDogs®],[WeRateDogs®],[4196983835],,,4907.0,4907.0,816829038950027264
1434,[madison],[_MaddyElizabeth],[user],[620042575],3704.0,,,,,,,1478.0,,695314793360662529


In [237]:
df_tweepy_extractions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2333 entries, 0 to 2332
Data columns (total 14 columns):
entities_name                2333 non-null object
entities_screen_name         2333 non-null object
entities_type                2333 non-null object
entities_user_id             2333 non-null object
favorite_count               2333 non-null float64
favorites_count_retweet      2333 non-null object
mentions_name                2333 non-null object
mentions_screen_name         2333 non-null object
mentions_user_id             2333 non-null object
quoted_status_id             2333 non-null object
quoted_status_id_rwetweet    2333 non-null object
retweet_count                2333 non-null float64
retweet_count_retweet        2333 non-null object
tweet_id                     2333 non-null object
dtypes: float64(2), object(12)
memory usage: 255.3+ KB


In [243]:
df_tweepy_extractions.query('retweet_count_retweet != ""') 

Unnamed: 0,entities_name,entities_screen_name,entities_type,entities_user_id,favorite_count,favorites_count_retweet,mentions_name,mentions_screen_name,mentions_user_id,quoted_status_id,quoted_status_id_rwetweet,retweet_count,retweet_count_retweet,tweet_id
31,,,,,0.0,1482,[Oakland A's],[Oakland A's],[19607400],886053434075471873,886053434075471873,101.0,101,886054160059072513
35,[Alexandra Gibson],[Chappee_98],[user],[437072817],0.0,68612,[WeRateDogs®],[WeRateDogs®],[4196983835],,,17287.0,17287,885311592912609280
67,,,,,0.0,40266,[WeRateDogs®],[WeRateDogs®],[4196983835],,,6358.0,6358,879130579576475649
72,,,,,0.0,7390,[WeRateDogs®],[WeRateDogs®],[4196983835],,,1203.0,1203,878404777348136964
73,,,,,0.0,20543,[WeRateDogs®],[WeRateDogs®],[4196983835],,,6237.0,6237,878316110768087041
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1001,,,,,0.0,2961,[WeRateDogs®],[WeRateDogs®],[4196983835],,,1004.0,1004,746521445350707200
1021,,,,,0.0,4490,[WeRateDogs®],[WeRateDogs®],[4196983835],,,2129.0,2129,743835915802583040
1220,,,,,0.0,929,"[Twitter, WeRateDogs®]","[Twitter, WeRateDogs®]","[783214, 4196983835]",,,128.0,128,711998809858043904
2236,[WeRateDogs®],[dog_rates],[user],[4196983835],0.0,171,[We Rate Dog Ratings],[We Rate Dog Ratings],[4296831739],,,34.0,34,667550904950915073


In [245]:
df_tweepy_extractions.query('retweet_count_retweet != ""').shape

(165, 14)

* `df_scrapped_replies`

In [248]:
df_scrapped_replies.sample(10)

Unnamed: 0,conversation,favs,full_name,image,language,references,replies,reply_id,retweets,text,timestamp,user_id,user_name
6068,882762694511734784,0.0,Mayhem Kevin™,,en,['4196983835'],0.0,884544289853845504,0.0,Cure pup,2017-07-11 00:46:07,837835939,@Mayham_Kevin
51053,724983749226668032,1.0,Katie McCarty,,und,['4196983835'],1.0,725043998922952704,0.0,,2016-04-26 21:28:54,2264927300,@katiemccarty_9
10423,875021211251597312,2.0,shaunie,,en,"['517724675', '4196983835', '398156819']",0.0,875024484000047105,0.0,"top marks from me, did me a spook x",2017-06-14 18:17:48,2400085956,@peace0fshit
21417,831911600680497154,0.0,Oli Whites Bae,,en,"['4196983835', '239715983']",0.0,831916716447891456,0.0,yes! Well done those who saved him im so happy !,2017-02-15 18:22:56,831587130312970240,@MillieJ6941
60320,687732144991551489,2.0,abby hyde,,und,['2863626632'],0.0,687820335345995776,0.0,AW,2016-01-15 03:15:21,2740380257,@itsabbyhyde
64802,678021115718029313,2.0,cassidy,,en,['370393849'],0.0,678446914342289409,0.0,I love him,2015-12-20 06:28:43,323349940,@casbeal
45109,756651752796094464,1.0,Sarah Busic,,en,"['1005260233', '19946206', '4196983835']",0.0,757293591634903041,0.0,amazing,2016-07-24 21:17:16,394930362,@Sarahbusic4
57059,698703483621523456,1.0,mads,,en,['35832263'],0.0,699385125977587712,0.0,@huyanaphour this is true,2016-02-16 01:09:42,2159234231,@madisongoebel13
35238,796116448414461957,0.0,Abielle c(__),,en,"['379440123', '4196983835', '34148664']",1.0,796118333431091200,0.0,Someone looks like they're totally over the el...,2016-11-08 23:32:56,122664841,@AbielleRose
25060,819015331746349057,0.0,Toad,,en,['4196983835'],0.0,819007295166304256,0.0,isn't she technically the second doggi,2017-01-11 03:25:30,1544737268,@TOADFROGTHING


In [247]:
df_scrapped_replies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72909 entries, 0 to 72908
Data columns (total 13 columns):
conversation    72909 non-null object
favs            72909 non-null float64
full_name       72456 non-null object
image           3864 non-null object
language        72909 non-null object
references      72909 non-null object
replies         72909 non-null float64
reply_id        72909 non-null int64
retweets        72909 non-null float64
text            65534 non-null object
timestamp       72909 non-null object
user_id         72909 non-null int64
user_name       72909 non-null object
dtypes: float64(3), int64(2), object(8)
memory usage: 7.2+ MB


#### Quality

##### `df_twitter_archive_enhanced` table:

- The type of the columns `tweet_id`, `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id` should be string.
- `doggo`,`floofer`,`pupper` and `puppo` columns should be categorical.
- We are not intereste in the first par of the text (RT @XXXX:) when the tweet is a retweet. I already have this information in other columns.
- There are 23 rows with rating_denominator different to 10.
- From those rows with a denominator equal to 10, some have a numerator not very realistic.
- The name of the dogs `a`, `O`, `by`, `an`, `the`, `his`, `all`and `my` are incorrect. 
- There are not many dogs classified as doggo, floofer, etc. And 14 of them have double clasification.


##### `df_image_predictions` table:

- The type of the column `tweet_id` should be string.


##### `df_tweepy_extractions` table:

- The type of the columns `favorites_count_retweet` and `retweet_count_retweet` should be integer. Anyway the column `retweet_count_retweet` has no sense because it has the same value as `retweet_count`.
- Nulls represented as void strings in `entities_name`,	`entities_screen_name`, `entities_type`, `entities_user_id`, `favorites_count_retweet`,	`entions_name`,	`mentions_screen_name`,	`mentions_user_id`,	`quoted_status_id`,	`quoted_status_id_rwetweet` and `retweet_count_retweet`.


##### `df_scrapped_replies` table:

- The type of the columns `favs`, `replies` and `retweets` should be integer instead of float.
- The type of the columns `user_id` and `reply_id` should be a string.
- `language` type should be categorical.
	




#### Tidiness