# INF8111 - Data Mining


## TP2 Autumn 2019 - Extraction and analysis of a tweets database

##### Team members:

    - Jean-Romain Roy (1720495)
    - Pankaj Raj Roy (Matricule)

## Overview


In 2017, Twitter had 313 million active users per month with 500 million tweets sent per day. This information is made available to web research and development through a public API that collects the information you want.

Nevertheless, Twitter's development policy limits the sharing of this data. Indeed, sharing the content of tweets in a database is not allowed, only the tweets identifiers are. To publicly share a tweets database that has been created, it is necessary that this database be made up only of tweets identifiers, and this is what is found in most public data sets.

It is therefore necessary to "hydrate" the tweets in question, i.e. extract all the information from the ID, which requires using the Twitter API.

We will use here public databases created by GWU (George Washington University), which have the advantage of being very recent:
https://dataverse.harvard.edu/dataverse/gwu-libraries

Each GWU database covers a specific topic (2016 U.S. election, Olympic Games, etc.), and the data was collected by applying queries that filtered the results to have only relevant tweets. A README file is provided with each database to give details of the creation of the *dataset*. 

**Thus, the objectives of this TP are as follows:**

 1. Build a crawler that collects tweet information from its ID, with the dataset of its choice and the relevant information for the chosen subject
 2. From these collected Twitter data, application of Machine Learning (ML)/Natural Language Processing (NLP) methods to provide relevant analysis. 

Twitter allowing the **local** sharing of data (for example within a research group), a database will be provided if you are unable to create your own.

# I/ Hydration of tweets using the Twitter API (4 Pts)

### 1. Getting Twitter authorization to use the API

For authentication, Twitter uses OAuth : https://developer.twitter.com/en/docs/basics/authentication/overview/oauth
You will need OAuth2 in particular here, because you will not interact with users on Twitter (simply collected data).

##### 1.1. Obtaining a developper account

 The first step required to register your application and create a Twitter developer account. To do this:

 - Create a basic Twitter account (if you don't have one)
 
 - On the website https://developer.twitter.com, click on *apply* to obtain a developer account. 
 
 - Fill in all the necessary fields. Twitter asks for a lot of details on how you will use this account, so it is important to explain the approach in detail: it is important to underline the fact that the project is **academic** (no commercial intention, no publication of the data collected, etc.), explain the objectives and learning of this TP (familiarisation with the Twitter API, the concrete application of Data Mining methods, etc.), but also explain in detail what you will do with the data, the methods you will apply, the report provided, etc.  If you are not specific enough, Twitter can send you an email to ask for clarification. 


##### 1.2. Obtaining access tokens 

 - When Twitter has validated your developer account request, go to https://developer.twitter.com/en/apps to create an app.

- Again, information is to be provided here. Some, like the name or the website, are not very important, you can put a fake website if you want.

- At the end of this process, you can finally get the keys and tokens to use the API: go to the application page to create the tokens. You must retrieve a key pair and a pair of tokens to move on.


In [1]:
from credentials import *

###  2. First steps with Twython

##### 2.1 Installation and import of the library


Several Python libraries exist to manipulate the Twitter API. Also called wrappers, they are a set of python functions that call API functions. Among them, we will use Twython, a widespread and actively maintained library.

Twython documentation : https://twython.readthedocs.io/en/latest/api.html 

In [2]:
import csv
import time
import sys

try:
    from twython import Twython, TwythonError, TwythonRateLimitError
except ImportError:
    !pip install twython --user

##### 2.2 Creation of an app and first test:

In [3]:
twitter = Twython(CONSUMER_KEY, CONSUMER_SECRET, oauth_token, oauth_secret)

## Import Custom Libs

In [4]:
from libs.states import states
from libs.states import findState as extract_place

import libs.preprocessor as tweet_preproc

In [5]:
# Init Preprocessor
twitterPreprocessor = tweet_preproc.TwitterPreprocessor()

## Helper Functions

In [6]:
# Number of rows in a file
def file_len(fname):
    
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    
    nbrOfLines = i + 1
    print("Nbr of lines : " + str(nbrOfLines))
    
    return nbrOfLines

In [7]:
# Concatenate multiple text documents
def concatDocuments(input_names,output_name):
    
    with open(output_name, 'w') as outfile:
        for fname in input_names:
            with open(fname) as infile:
                for line in infile:
                    outfile.write(line)

#### Question 1. Display the date, username and content of the tweet with the ID : 1157345692517634049

*Hint : you can use the twython function `show_status`*

In [8]:
test_id = "1157345692517634049"

status = twitter.show_status(id=test_id)

In [9]:
print(str(status['user']['name']) + " - " + str(status['created_at']) + "\n")
print(str(status['text']))

Donald J. Trump - Fri Aug 02 17:41:30 +0000 2019

A$AP Rocky released from prison and on his way home to the United States from Sweden. It was a Rocky Week, get home ASAP A$AP!


**Warning:** Twitter has a 15-minute window request limit, which is therefore to be taken into account in the database:  https://developer.twitter.com/en/docs/basics/rate-limiting.html

### 3. Hydratation of a tweets database

The serious business is starting!

We now want to build a `hydrate_database` function that, from a text file containing a list of tweets IDs, creates a csv file containing the information we want to extract.

Due to the request limitation, the `show_status` function seen above is not very effective for this task: at 900 requests for 15 minutes, it will take far too long to build an interesting database. The `lookup_status` function (see documentation) will therefore be more suitable. It will allow to hydrate 100 tweets per request, which, with a limit of 900 requests for 15 minutes, makes the construction of the database more realistic. It will still be necessary to manage the error generated by the limitation, if you want to have more than 90000 tweets or if you call the function several times in less than 15 minutes.

#### Question 2. Implement the method `hydrate_database`

*Warning : It is also necessary to manage the case where the requested feature is not a dictionary key but a subkey, as is the case for the user name for example.*

*Hint : The `sleep` function of time library allows to wait the desire time.*

In [10]:
def hydrate_database(filename, database_name, features, 
                     checkLocation=True,
                     nb_requests=899, 
                     tweet_hydratation_limit=100,
                     start_index=0):
    """
    Create a csv file that contains features of tweets from an file that contains ID of tweets.
    
    filename: Name of the file that contains ids
    database_name: name of the file that will be created
    features: List of features
    nb_requests: number of time the function lookup_status will be called
    tweet_hydratation_limit: Ids bundle length for each request
    
    """
    
    # Count number of lines in file
    linesCount = file_len(filename)
    fracOfLines = round(10000.0*(nb_requests*tweet_hydratation_limit)/(1.0*linesCount))/100.0
    print("Fraction about to be Hydrated : " + str(fracOfLines))

    # Opening the ID File:
    file = open(filename, "r")
    
    # Creation of the file that will contain the hydrated tweets:
    with open(database_name, 'w+', newline='', encoding="utf-8") as csvfile:
        
        # First line write the headers
        headers = []
        for feature in features:
            if(len(feature) == 1):
                headers.append(feature[0])
            elif(len(feature) == 2):
                headers.append(feature[1])
        
        headers = ','.join(headers) + '\n'
        csvfile.write(headers)
        
        ids = []
        row_counter = 0
        req_counter = 0
        for row in file:
            
            # increment row counter
            row_counter = row_counter + 1

            # Pagination skip
            if(row_counter < start_index):
                continue
            
            # Strip
            row = row.strip()
            
            # Append the id on the row
            ids.append(row)
            

            # If we have our bundle
            if((len(ids) % tweet_hydratation_limit) == 0):
                
                # increment request counter
                req_counter = req_counter + 1

                # We are done
                if(req_counter > nb_requests):
                    break
                
                # Convert to comma separated
                ids = list(set(ids))
                ids_csv = ",".join(ids)
                
                # Clear array
                ids = []

                try: # If you don't reach the limit of requests
                    
                    # Send Request for all the ids
                    all_status = twitter.lookup_status(id=ids_csv)
                    
                    # Go through individual status
                    for status in all_status:
                        
                        # init new line
                        newline = []
                        
                        # go through features
                        featureExtracted = True
                        for feature in features:
                            
                            if(len(feature) == 1):
                                
                                # Get datum
                                datum = str(status[feature[0]]).strip()
                                
                                if(feature[0] == 'text'):
                                
                                    # Clean
                                    datum = twitterPreprocessor.clean(datum)
                                    
                                elif(feature[0] == 'created_at'):
                                    
                                    # Parse Datetime
                                    datum = twitterPreprocessor.to_datetime(datum)
                                    
                                # Add to line
                                newline.append(datum)
                                
                            elif(len(feature) == 2):
                                
                                # Get the datum
                                datum = str(status[feature[0]][feature[1]]).strip()
                                
                                if(feature[1] == 'description'):
                                    
                                    # Clean
                                    datum = twitterPreprocessor.clean(datum)
                                    
                                elif(feature[1] == 'location' and checkLocation == True):
                                    
                                    if(datum == None):
                                        featureExtracted = False
                                    elif(len(datum) == 0):
                                        featureExtracted = False
                                    else:
                                        datum = extract_place(datum)
                                        if(datum == None):
                                            featureExtracted = False
                                
                                    if(featureExtracted == False):
                                        break
                                 
                                # Add to line
                                newline.append(datum)
                         
                        # Check if all features were extracted
                        if(featureExtracted == True):
                            
                            # Write to output file
                            newline = '"' + '","'.join(newline) + '"\n'
                            csvfile.write(newline)                        
                    

                except TwythonError as e:
                    if isinstance(e, TwythonRateLimitError):
                        retry_after = int(e.retry_after)
                        
                        print("Nbr of requests done : " + str(req_counter))
                        print("sleeping for 15 min and 30 seconds")
                        time.sleep(930)
                        

        print("Nbr of requests done : " + str(req_counter))

    file.close()


Use this file as an example : 
https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/5QCCUU/QPYP8G&version=1.1

We only want to keep the content (*text*) and the user ID (*user/screen_name*)

# Constants

In [11]:
collections = ["sources/general/election-filter1.txt","sources/general/election-filter2.txt","sources/general/election-filter3.txt","sources/general/election-filter4.txt","sources/general/election-filter5.txt","sources/general/election-filter6.txt"]

In [12]:
features = [['id'],['created_at'],['text'],['user','screen_name'],['user', 'description'],['user','location'],['user','followers_count'],['user','statuses_count'],['retweet_count']]

In [13]:
NBR_REQUEST = 898
HYDRATION_LIMIT = 100

# Get Election Data

In [14]:
collection_id = 4
start_index = 28980000
start_date = '2016-11-08_2'

filename = collections[collection_id]
database_name = "data/" + start_date + "_" + str(start_index) + "_election-filter" + str(collection_id+1) + ".csv"

hydrate_database(filename, database_name, features, nb_requests=NBR_REQUEST, tweet_hydratation_limit=HYDRATION_LIMIT, start_index=start_index)

Nbr of lines : 50000000
Fraction about to be Hydrated : 0.18
Nbr of requests done : 899


# The democratic Party and Candidates

In [None]:
filename1 = "sources/parties_candidates/democratic-candidate-timelines.txt"
filename2 = "sources/parties_candidates/democratic-party-timelines.txt"
database_name1 = "data/parties_candidates/democratic/democratic-candidate-timelines.csv"
database_name2 = "data/parties_candidates/democratic/democratic-party-timelines.csv"

# interchange filename/database
hydrate_database(filename1, database_name1, features, nb_requests=899, tweet_hydratation_limit=100, checkLocation=False)

# Republican Party and Candidates

In [None]:
filename1 = "sources/parties_candidates/republican-candidate-timelines.txt"
filename2 = "sources/parties_candidates/republican-party-timelines.txt"
database_name1 = "data/parties_candidates/republican/republican-candidate-timelines.csv"
database_name2 = "data/parties_candidates/republican/republican-party-timelines.csv"

# interchange filename/database
hydrate_database(filename1, database_name1, features, nb_requests=899, tweet_hydratation_limit=100, checkLocation=False)

# Merge Documents

In [15]:
from glob import glob

In [16]:
filenames=  glob("data/general/2016-11-09/*.csv")
#filenames = ['data/2016-11-07_0_25900000_election-filter5.csv','data/2016-11-07_1_25990000_election-filter5.csv','data/2016-11-07_2_26080000_election-filter5.csv']
output_name = 'data/general/2016-11-09/newtweets.csv'

concatDocuments(filenames,output_name)

# II/ Analysis of a database of your choice  (16 pts)

Now that you are able to hydrate a tweets database efficiently and taking into account the limitations of Twitter, you can apply it to the *dataset* that interests you most.

### 1. Instructions

In this section, you will conduct a **full** Data Science project, from data collection to interpretation of the results. You must choose from the following 3 topics:
 
 1. Sentiment analysis for prediction of the results of the American election.

    **Dataset**: "2016 United States Presidential Election Tweet Ids", https://doi.org/10.7910/DVN/PDI7IN  
 
    **Note:** This subject is quite similar to TP1 (with here feeling = political party), so you are free to take up what you did. However, we should go a little deeper here, for example on the classification stage. In addition, you have a new problem here which is that your data is not labelled (but the construction of the collections should allow you to label yourself).


 2. Hate speech detection.
    
    **Dataset**: Modify your hyrate function to generate a database with recent tweets, using the `search` function.
 
    **Note:** This subject could also be addressed in the same way as TP1: preprocessing + classification steps. However, in this case, having data with labels "hateful"/ "not hateful" is much more complex, because many databases labeled, when hydrated, will be almost empty, because the tweets will have been deleted by the time we make our request (because Twitter also ensures that hate tweets are deleted). That's why you have to create a database with the most recent tweets possible, before they are potentially deleted. To designate a tweet as hateful, one method would be to detect hateful vocabulary, for example with `hatebase.org`, which offers large and comprehensive databases. You can create an account on the website to access the API, and then use this library for Python: https://github.com/DanielJDufour/hatebase. If you modify the query to have only tweets containing this vocabulary, and mix it with sentiment analysis, you can get results to analyze. You could also have a "user" approach to search for hate tweets: when a tweet is detected as hate, inspect all the user's tweets and/or followers' tweets. To sum up, you have many possibilities, but this subject is probably the most complex one (but I will take it into account). I will therefore be less demanding on 'quantified' results, the important thing here is more the analysis, and having a coherent approach (it is also very important to take the time to think about a clear definition of 'hateful').


 
 3. Clustering methods applied tweets on current events, and analysis of results.

    **Dataset**: "News Outlet Tweet Ids", https://doi.org/10.7910/DVN/2FIFLH

    **Note:** Application of preprocessing methods, then clustering methods to group tweets that mention the same news or news category (your choice!), then visualization, evolution in function of time... You will need to find out what is the best clustering method, and this will depend on your approach (number of classes known? if yes, how many classes?)

You are entirely free on the whole process (choice of extracted information, ML methods, library, etc.). As these topics are popular in the scientific community, you can (**if you wish**) draw inspiration from articles in the literature, provided you quote it in your report and make your own implementation. 


#### The objective here, however, is not to obtain the state of the art, but to apply a clear and rigorous methodology that you have built yourself. 

Since datasets are massive, it is strongly discouraged to make a database containing all hydrated tweets (for example, the authors of database #1 point out that with the limitations of the API it would take you about 32 days). It is up to you to see what size dataset you need.

If you are doing supervised learning: you will need to train a model with a labelled set, and so you have two options available. Either you will have to retrieve labelled data, or you are able to label your data yourself (for example, in the case of subject 1, the database is divided into collections, some of which depend on the political party). You can reuse your TP1 implementation, but you are asked to explore a little more deeply here, especially the classification methods.

Also remember to read the README file corresponding to the database you have chosen, in order to help you better understand your future results.

### 2. Report

For this TP, you will have to provide a report that details and justifies your overall method, and provides the results you have obtained. The following elements must appear (this can be used as a plan, but it is not rigid):

- Project title, and name of all team members (with mail and matricule)
    
- **Introduction** : summary of the problem, methodology and results obtained

- **Dataset presentation** : description, justification of size, choice of features, etc. 

- **Preprocessing** : if any, justification of the preprocessing steps  

- **Methodology** : description and justification of all choices (algorithms, hyper-parameters, regularization, metrics, etc.)

- **Résults** : analysis of the results obtained (use figures to illustrate), linking design choices to the performance obtained.

- **Discussion** : discuss the advantages and disadvantages of your approach; what are the weaknesses, the flaws? What can be improved? You can also suggest future exploration ideas.

- **References** : if you have been inspired by a study already done.
    
You can use the arXiv template for the report : https://fr.overleaf.com/latex/templates/style-and-template-for-preprints-arxiv-bio-arxiv/fxsnsrzpnvwc. **However, the entire report should not exceed 5 pages, including figures and references.** The 5 pages are not mandatory, if you think that less is enough and your report is indeed complete, you will not be penalized.


### 3. Expected files

At the end of the zip file, you will submit a *zip* file with all the following elements:

- The pdf file of the report
- This notebook that you will have completed. You can also implement your method further here, or use another file if you wish (the code must be commented and clear).
- Do not send the data files because they are too large. With the report and the code, everything will be detailed and it will be possible to do it again easily.

### 4. Evalutation

75% of the grade (which is 12 points) of this part will be based on the methodology, and 25% (4 points) on the results.

The rating on the methodology included: 

- The relevance of all the steps of the approach

- The correct description of the selected algorithms

- The judicious justification of the established choices

- A relevant analysis of the results

- Clarity and organization of the report (figures, tables) and code.


As for the results, it is impossible to set a fixed scale because they will depend on the chosen subject. This is a problem you will face: each problem being specific, it can be complicated to qualitatively evaluate a model, especially since you probably do not know the state of the art. That is why it will be important to do several tests, and to compare different methods. Thus, the results must be consistent with the complexity of your implementation: a simple and naive model will provide you with initial results, which you will then have to improve with more precise and complex models.

Therefore, all points for the results will be given if: 
 - You obtain first results with a naive method that testify to the relevance of your choices 
 - These results are then significantly improved with a more complex method
 - All this is well justified and put into the context of the problem 