# Midas Internship Assignment

## Problem 1: Python_Problem 


## Problem Statement: 
You have to write a python script which can fetch all the tweets(as many as allowed by Twitter API) done by midas@IIITD twitter handle and dump the responses into JSONlines file. 
The other part of your script should be able to parse these JSONline files to display the following for every tweet in a tabular format. 
- The text of the tweet.
- Date and time of the tweet.
- The number of favorites/likes.
- The number of retweets.
- Number of Images present in Tweet. If no image returns None.

### import required libraries 
- json/jsonlines: for saving and loading the data
- pandas: for table formatting
- tweepy: for accessing Twitter API

In [24]:
import tweepy
from tweepy import OAuthHandler
import json
import jsonlines
import pandas as pd
from pandas import DataFrame

### Authentication for access of Twitter API

- Keys have been removed

In [25]:
consumer_key = '******'
consumer_secret = '******'
access_key = '******'
access_secret = '******'

In [26]:
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)

In [27]:
api = tweepy.API(auth)

### Storing the tweets

This function is for storing the tweets in the jsonlines file (output.jsonl)

In [28]:
data = []
def storeTweets(tweet):
    with open('output.jsonl', 'a') as outfile:
        json.dump(tweet,outfile)
        outfile.write('\n')
        

### Mine the Tweets

This function is used for mining the tweets according to the given username/twitter handle and stores them in the json file

In [29]:
def mineTweets(username):
    tweets = api.user_timeline(screen_name=username)
    for tweet in tweets:
        storeTweets(tweet._json)

In [30]:
mineTweets('midasIIITD')

## Sneak peak through jsonlines files
Viewing the output.jsonl file 

In [37]:
data=[];
with jsonlines.open('output.jsonl') as reader:
    for obj in reader:
        data.append(obj)


# Calculating the number of tweets

Using Cursor function to calculate the total number of tweets. 
(Another method could also be used, but this is the simplest)

In [38]:
iterator = tweepy.Cursor(api.user_timeline,
                         screen_name='@midasIIITD').items()
data = [status._json for status in iterator]

print(f"Number of Tweets: {len(data)}\n")

Number of Tweets: 344



### Most important part, collecting the important information

Making a list(info) which contains all information related to the text, date and time of tweet, number of likes, no. of retweets and number of images present

- If the tweet is a retweet, make the datapoint to the original tweet. 

- To extract and count the number of images/photos, I have used 'extended_entities' key for media. (Given in Twitter Developer Doc)
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/extended-entities-object)
    
- From the output.jsonl file, we see that the text, time/date of tweet, number of likes, number of retweets can all be extracted using ['text'], ['created_at'], ['favourite_count'], ['retweet_count'] options

-
- Naming the columns for the table accordingly 

In [39]:
info = []
for datatweet in data:
    try:
        datatweet = datatweet['retweeted_status']
    except KeyError:
        pass
    
    try:
        image_count = sum(m['type'] in ('photo') for m in datatweet['extended_entities']['media'])
    except KeyError:
        image_count = "None"
    
    # create the entry with the required information
    entry = {
        'Text': datatweet['text'], 
        'Date and Time of Tweet': datatweet['created_at'], 
        'No. of favorites/likes': datatweet['favorite_count'], 
        'No. of retweets': datatweet['retweet_count'], 
        'No. of images present': image_count 
    }
    
    # save the entry
    info.append(entry)

print(f"Number of Entries: {len(info)}\n")

Number of Entries: 344



### Displaying the information in the table using DataFrame by Pandas


In [40]:
df = pd.DataFrame.from_dict(info, orient='columns')
df

Unnamed: 0,Date and Time of Tweet,No. of favorites/likes,No. of images present,No. of retweets,Text
0,Wed Apr 10 09:01:29 +0000 2019,0,,0,Clarification: Our earlier post which indicate...
1,Wed Apr 10 04:28:24 +0000 2019,3,,1,Applications open for MTech (CB) through JNU C...
2,Tue Apr 09 09:03:12 +0000 2019,60,,14,We are delighted to share that IIIT-Delhi is r...
3,Mon Apr 08 20:10:01 +0000 2019,99,,36,"Professor Jelani Nelson founded AddisCoder, a ..."
4,Mon Apr 08 17:35:00 +0000 2019,38,,17,For anyone interested in submitting to EMNLP 2...
5,Mon Mar 18 06:40:38 +0000 2019,20,,15,Announcing the 2019 MediaEval multimedia tasks...
6,Mon Apr 08 07:08:12 +0000 2019,18,,2,"Many Congratulations to @midasIIITD student, S..."
7,Mon Apr 08 03:27:42 +0000 2019,5,,0,@midasIIITD thanks all students who have appea...
8,Sun Apr 07 14:17:29 +0000 2019,0,,0,"@himanchalchandr Meanwhile, complete CV/NLP ta..."
9,Sun Apr 07 14:17:09 +0000 2019,0,,0,@sayangdipto123 Submit as per the guideline ag...
