# Mining (together with a bit of web scraping) of large social networks from Twitter using Python (and Ruby)
## By Moses Boudourides and Sergios Lenis 
## University of Patras, Greece

<p> </p>

<p> </p>

<p> </p>

**Table of Contents**

[I. Prerequisite Python Modules and Scripts](#I)

[II. Twitter Mining from the Twitter API](#II)

[III. Web Scraping from the Twitter Advanced Search](#III)

[IV. Statistical & Network Analyses](#IV)

<p> </p>

<p> </p>

<p> </p>

<a id='I'></a>
## I. Prerequisite Python Modules and Scripts

### **The following cell imports the prerequisite Python modules for this network to run**

In [1]:
import pandas as pd
import json 
import os
import imp

### **First, one has to download the *github* directory *https://github.com/mboudour/TwitterMining*, where everything needed for this notebook to run is included.**

### **Github Blog: *http://mboudour.github.io/***

### **Furthermore, one needs to have already installed all the modules imported in the script** *collect_tweets_notebook.py*. Some of these modules can be installed and imported from the notebook as follows (without #):

In [2]:
# !pip install python-twitter
# import twitter

<a id='II'></a>
## II. Twitter Mining from the Twitter API (https://apps.twitter.com/)

### **Setting Input and Output Directories**

In [2]:
# input_dir='/Users/mosesboudourides/GithubRepositories/TwitterMining'
input_dir='/home/mab/github_repos/TwitterMining'

# output_dir='/Users/mosesboudourides/twitTemp'
output_dir='/home/mab/Desktop/twitTemp'

# cred_dic=None
cred_dic='/home/mab/Desktop/twitTemp/credentials/auth_cred.txt'
# cred_dic='/Users/mosesboudourides/twitTemp/credentials/auth_cred.txt'
# cred_dic='/home/mab/Dropbox/Python Projects/EUSN2016_TwitterWorkshop/TwitterMining/credentials/auth_cred.txt'
# cred_dic='/media/sergios-len/Elements/Brighton_workshop/auth_cred.txt'

pp= !pwd
os.chdir(input_dir)
from test_class_tpa import create_df
import collect_tweets_notebook as ctn

os.chdir(pp[0])

def create_beaker_com_dict(sps):
    nsps={}
    for k,v in sps.items():
        nsps[k]=[]
        if k=='date_split':
            for kk in sorted(v.keys()):
                nsps[k].append(v[kk].strftime('%Y%m%d'))
        else:
            for kk in sorted(v.keys()):
                nsps[k].append(v[kk])

    return nsps

I am being imported from another module


### **Authentication and login in Twitter API**

In [3]:
vv=ctn.UserAuth(auth_file=cred_dic)

### **After the authentication tokens are known, one has to insert them below by decommenting and running the following three cells:**

In [4]:
vv.login()

In [5]:
vv.check_login()

{"created_at": "Mon Dec 29 11:20:53 +0000 2008", "description": "\u30e2\u30fc\u30bc\u30ba", "favourites_count": 1031, "followers_count": 1174, "friends_count": 420, "geo_enabled": true, "id": 18447918, "lang": "en", "listed_count": 144, "location": "Patras, Greece", "name": "Moses Boudourides", "profile_background_color": "9C584B", "profile_background_image_url": "http://pbs.twimg.com/profile_background_images/150481468/wordpainting.jpg", "profile_background_tile": true, "profile_banner_url": "https://pbs.twimg.com/profile_banners/18447918/1377851188", "profile_image_url": "http://pbs.twimg.com/profile_images/378800000721710479/c093b7142774b1c8a07b48a8edca8d37_normal.png", "profile_link_color": "FF0D00", "profile_sidebar_fill_color": "FFF7CC", "profile_text_color": "0C3E53", "screen_name": "mosabou", "status": {"created_at": "Thu Jun 30 04:43:59 +0000 2016", "hashtags": [], "id": 748376514710863872, "id_str": "748376514710863872", "lang": "en", "media": [{"display_url": "pic.twitter.co

In [6]:
twi_api=vv.get_auth()

### **Setting up a Search**
#### **Further info about how to build a Twitter query is available at: https://dev.twitter.com/rest/public/search.**

In [7]:
search_term='@MediaGovGr'

In [8]:
sea=ctn.TwitterSearch(twi_api,search_text=search_term,working_path=output_dir,out_file_dir=None,
max_pages=10,results_per_page=100,sin_id=None,max_id=None,verbose=True)

In [9]:
sea.streamsearch()

216 tweets preloaded from /home/mab/Desktop/twitTemp/Output/MediaGovGr.ids
18 new tweets logged at /home/mab/Desktop/twitTemp/Output/MediaGovGr.json
1 100 aa None Mon Jun 27 09:18:50 +0000 2016
57 new tweets logged at /home/mab/Desktop/twitTemp/Output/MediaGovGr.json
3 92 aa 745878183619223552 Wed Jun 22 06:27:59 +0000 2016


TwitterError: [{u'message': u'Rate limit exceeded', u'code': 88}]

### **<font color='red'>To interrupt the collection of tweets initiated above, one has to click "Kernel > Interrupt" from the Notebook menu.</font>**

### **The data collected from the above search are saved as a json file in the above defined output_dir named by the above defined search_term.**

### **In the json file, there are four main “objects” provided by the API:** 
* **Tweets,** 
* **Users,** 
* **Entities and** 
* **Places.**

### **Definitions and info about all these ojects is given in https://dev.twitter.com/overview/api.**

## Selecting 21 practically intersesting "objects" and creating a Pandas data frame with them as columns.

In [13]:
columnss=['id','user_id','username','created_at','language','hashtag_count','retweet_count','mention_count',
          'statuses_count','followers_count','friends_count','listed_count','videos_count','photos_count',
          'undef_count','coordinates','bounding','place','hashtags','mentions','text'] 
for i in columnss:
    print i

id
user_id
username
created_at
language
hashtag_count
retweet_count
mention_count
statuses_count
followers_count
friends_count
listed_count
videos_count
photos_count
undef_count
coordinates
bounding
place
hashtags
mentions
text


<a id='III'></a>
## III. Twitter Mining from the Twitter Advanced Search

### **For Twitter Scraping, we are using the *Beaker Notebook *https://pub.beakernotebook.com/publications/ee134c26-2b23-11e6-abb8-6fa10fd07640?fullscreen=true**.

### **Again, one needs to have already installed all the modules imported in this *Beaker Notebook* as well in all the scripts of the *github* directory *https://github.com/mboudour/TwitterMining*.** 

### **First, one should start with an advanced search at *https://twitter.com/search-advanced.* **

### **If the searched term is a hashtag (or multiple hashtags), one should continue with the *Beaker Notebook* *https://pub.beakernotebook.com/publications/ee134c26-2b23-11e6-abb8-6fa10fd07640?fullscreen=true*.**

### **Otherwise (for non-hashtag-type search terms), one has to open the page with the outcome of the Twitter search, copy the substring in the URL that follows 'search?q=' before '&src=" and paste it in the second cell of the *Twitter Scraping in Ruby Beaker Notebook* after 'searchterm='. In the search below, it suffices to copy the string 'day%20night%20paris'.**

<a id='IV'></a>
## IV. Statistical & Network Analyses

## **An example done in the *Beaker Notebook* https://pub.beakernotebook.com/publications/3a62f03e-27e8-11e6-9ac4-6732ff96645f?fullscreen=true**