# Mining (together with a bit of web scraping) of large social networks from Twitter using Python (and Ruby)
## By Moses Boudourides and Sergios Lenis 
## University of Patras, Greece

<p> </p>

<p> </p>

<p> </p>

**Table of Contents**

[I. Prerequisite Python Modules and Scripts](#I)

[II. Twitter Mining from the Twitter API](#II)

[III. Web Scraping from the Twitter Advanced Search](#III)

[IV. Visualization of Timeseries, Geolocations & Network Analysis of Twitter Data](#IV)

<p> </p>

<p> </p>

<p> </p>

<a id='I'></a>
## I. Prerequisite Python Modules and Scripts

### **The following cell imports the prerequisite Python modules for this network to run**

In [1]:
import pandas as pd
import json 
import os
import imp

### **First, one has to download the *github* directory *https://github.com/mboudour/TwitterMining*, where everything needed for this notebook to run is included.**

### **Github Blog: *http://mboudour.github.io/***

### **Furthermore, one needs to have already installed all the modules imported in the script** *collect_tweets_notebook.py*. Some of these modules can be installed and imported from the notebook as follows (without #):

In [3]:
# !pip install python-twitter
# import twitter

Collecting python-twitter
Collecting future (from python-twitter)
Collecting requests-oauthlib (from python-twitter)
  Downloading requests_oauthlib-0.6.2-py2.py3-none-any.whl
Collecting oauthlib>=0.6.2 (from requests-oauthlib->python-twitter)
Installing collected packages: future, oauthlib, requests-oauthlib, python-twitter
Successfully installed future-0.15.2 oauthlib-1.1.2 python-twitter-3.1 requests-oauthlib-0.6.2


<a id='II'></a>
## II. Twitter Mining from the Twitter API (https://apps.twitter.com/)

### **Setting Input and Output Directories**

In [3]:
input_dir='/home/mosesboudourides/Dropbox/Python Projects/DublinNovemer2016'

output_dir='/home/mosesboudourides/Dropbox/Python Projects/DublinNovemer2016/Out_json'

# cred_dic=None # Run this only the first time!
cred_dic='/home/mosesboudourides/Dropbox/Python Projects/DublinNovemer2016/credentials/auth_cred.txt'

pp= !pwd # for Mac OS X or Unix
# use 'pp = !cd' in Windows

os.chdir(input_dir)
from test_class_tpa import create_df
import collect_tweets_notebook as ctn

os.chdir(pp[0])

def create_beaker_com_dict(sps):
    nsps={}
    for k,v in sps.items():
        nsps[k]=[]
        if k=='date_split':
            for kk in sorted(v.keys()):
                nsps[k].append(v[kk].strftime('%Y%m%d'))
        else:
            for kk in sorted(v.keys()):
                nsps[k].append(v[kk])

    return nsps

### **Authentication and login in Twitter API**

In [4]:
vv=ctn.UserAuth(auth_file=cred_dic)

### **After the authentication tokens are known, one has to insert them below by decommenting and running the following three cells:**

In [5]:
vv.login()

In [9]:
vv.check_login()

In [10]:
twi_api=vv.get_auth()

### **Setting up a Search**
#### **Further info about how to build a Twitter query is available at: https://dev.twitter.com/rest/public/search.**

In [11]:
search_term='Charlie Flanagan'

In [12]:
sea=ctn.TwitterSearch(twi_api,search_text=search_term,working_path=output_dir,out_file_dir=None,
max_pages=10,results_per_page=100,sin_id=None,max_id=None,verbose=True)

In [14]:
sea.streamsearch()

### **<font color='red'>To interrupt the collection of tweets initiated above, one has to click "Kernel > Interrupt" from the Notebook menu.</font>**

### **The data collected from the above search are saved as a json file in the above defined output_dir named by the above defined search_term.**

### **In the json file, there are four main “objects” provided by the API:** 
* **Tweets,** 
* **Users,** 
* **Entities and** 
* **Places.**

### **Definitions and info about all these ojects is given in https://dev.twitter.com/overview/api.**

### For our purposes, we are selecting 21 practically intersesting "objects" and eventually creating a Pandas data frame with them as columns.

#### Twitter json "objects" https://dev.twitter.com/rest/reference/get/search/tweets

In [13]:
columnss=['id','user_id','username','created_at','language','hashtag_count','retweet_count','mention_count',
          'statuses_count','followers_count','friends_count','listed_count','videos_count','photos_count',
          'undef_count','coordinates','bounding','place','hashtags','mentions','text'] 
for i in columnss:
    print i

id
user_id
username
created_at
language
hashtag_count
retweet_count
mention_count
statuses_count
followers_count
friends_count
listed_count
videos_count
photos_count
undef_count
coordinates
bounding
place
hashtags
mentions
text


<a id='III'></a>
## III. Web Scraping from the Twitter Advanced Search

### Since this is implemented in *Ruby*, it is required that this programming language is already installed in the computer where one intends to run the following procedure.

### **The Twitter advanced search page is located at *https://twitter.com/search-advanced.* **

### **For Twitter Scraping, one should run the following two Ruby scripts:
1. scraping_script.rb
2. getting_json_script.rb

### **If the searched term is a hashtag (or multiple hashtags), one should insert it (them) directly in line 22 after 'searchterm=' in the script *scraping_script.rb*. One should also insert the dates *from* and *to* in lines 23 and 26 of this script. Moreover, line 29 should contain "hashtag=true".**

### **For non-hashtag-type search terms, one has to open the page with the outcome of the Twitter search, copy the substring in the URL that follows 'search?q=' before '&src=" and paste it in line 22 after 'searchterm=' in the script *scraping_script.rb*. Again, one should also insert the dates *from* and *to* in lines 23 and 26 of this script. Of course now, line 29 should contain "hashtag=false".**

### **Subsequently, one should run the script *getting_json_script.rb* to generate the json file of the scraped tweets.

<a id='IV'></a>
## IV. Visualization of Timeseries, Geolocations & Network Analysis of Twitter Data

### **This is done using (the template of) the *Beaker Notebook*:**
* beaker_analysis_base_Viz.bkr