# Online Data Collection
- Shared
 - API based
 - Files and databases
- Web Scrapping
- Manual
 - Surveys
 - Observation (When desperate)
 

# Online Data Collection Cont.

- Data is likely to not be organized as rows and columns
- You have to organize the data manually and contruct a DataFrame
- Need to determine what your variables will be
- Need to be very careful with level of analysis here

# Constructing Data Frames Manually

## Option 1: A List of Dictionaries

In [36]:
import pandas as pd

record1 = {"name":"mohammed", "id":1234, "age":45}
record2 = {"name":"Ali", "id":1235, "age":35}
record3 = {"name":"Sara", "id":1236, "age":25}

list_of_records = [record1, record2, record3]

In [2]:
# Here is another way to write the previous code

list_of_records = [
    {"name":"mohammed", "id":1234, "age":45},
    {"name":"Ali", "id":1235, "age":35},
    {"name":"Sara", "id":1236, "age":25},
]

In [3]:
# To create a DataFrame you construct an object

df = pd.DataFrame(list_of_records)
df

Unnamed: 0,age,id,name
0,45,1234,mohammed
1,35,1235,Ali
2,25,1236,Sara


# Important Notes
- All dictionaries must have the same keys
- The keys will represent the column names
- The values of the dictionary will represent the record values
- Order of columns is unpredictable

## Option 2: List of Lists

In [4]:
record1 = ["mohammed", 1234, 45]
record2 = ["Ali", 1235, 35]
record3 = ["Sara", 1236, 25]

list_of_records = [record1, record2, record3]

In [20]:
# Alternative way of constructing the list of records
list_of_records = [
    ["mohammed", 1234, 45],
    ["Ali", 1235, 35],
    ["Sara", 1236, 25],  
]

In [21]:
# To create a DataFrame you construct an object

df = pd.DataFrame(list_of_records)
df

Unnamed: 0,0,1,2
0,mohammed,1234,45
1,Ali,1235,35
2,Sara,1236,25


# Important Notes
- All lists must have the same number of items
- Order of columns will be maintained, first item will be part of first column, and so on
- Columns will be numbered, but you can rename them

In [22]:
df.columns = ["name", "id", "age"]
df

Unnamed: 0,name,id,age
0,mohammed,1234,45
1,Ali,1235,35
2,Sara,1236,25


## Adding Rows

Assuming you are using the default index then you can do it this way

In [23]:
df.loc[len(df)] = ["zaid", 1237, 33]
df

Unnamed: 0,name,id,age
0,mohammed,1234,45
1,Ali,1235,35
2,Sara,1236,25
3,zaid,1237,33


# API Based Data Collection
- Search for the term "API" for online servies
 - Most social media services will have one
- Each API will have a unique way to access it


# API Protocols
- Restful
 - Most popular
 - HTTP based
- RPC/Soap
- GraphML

# Restful APIs
- You can access the API through the browser or an HTTP library
 - Requests is the most popular for python
- Dedicated libraries for specific services are available to make access easier
 - The libraries are called **clients**
 - Search pip, google, and github for available options

# How Restful APIs Work
- Connect and authenticate using an HTTP client, like requests
- Execute API action by calling an HTTP url and setting the appropriate HTTP action
 - Main actions are: GET, POST, PUT, DELETE
 - You will mostly use get
- Response is likely JSON or XML which you have to parse into python objects

# The GET Restful Action
- This is the standard browser action when you type a URL
 - You can view Restful responses in the browser if not authenticated
- You choose what you want to fetch through the url, e.g.:
 - Fetch list of tweets using following url path:
 `followers/list`
 - See api [here](https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-followers-list)
- Some [require authentication](https://api.twitter.com/1.1/followers/list.json?cursor=-1&screen_name=twitterdev&skip_status=true&include_user_entities=false)
- Other [do not](https://api.github.com/repositories) and can be viewed in browser

# Restful APIs Cont.
- Authentication is likely required (Difficult part)
- Be aware of transfer/access limitations
- **You have to read the API documentation!**

In [24]:
# Restful Example

import requests

# This is an authenticated example
# see documentation for service on how to authenticate
# It usually involves generating an access token

# fetch github /repositories/ using a GET request
data = requests.get("https://api.github.com/repositories")

In [26]:
# Response contains text
data.text

# You can parse this data

'[{"id":1,"name":"grit","full_name":"mojombo/grit","owner":{"login":"mojombo","id":1,"avatar_url":"https://avatars0.githubusercontent.com/u/1?v=4","gravatar_id":"","url":"https://api.github.com/users/mojombo","html_url":"https://github.com/mojombo","followers_url":"https://api.github.com/users/mojombo/followers","following_url":"https://api.github.com/users/mojombo/following{/other_user}","gists_url":"https://api.github.com/users/mojombo/gists{/gist_id}","starred_url":"https://api.github.com/users/mojombo/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/mojombo/subscriptions","organizations_url":"https://api.github.com/users/mojombo/orgs","repos_url":"https://api.github.com/users/mojombo/repos","events_url":"https://api.github.com/users/mojombo/events{/privacy}","received_events_url":"https://api.github.com/users/mojombo/received_events","type":"User","site_admin":false},"private":false,"html_url":"https://github.com/mojombo/grit","description":"**Grit is no lo

In [27]:
# HTTP status code
data.status_code

200

In [32]:
# This is parsed data
# That is converted into python data structures
d = data.json()

d

[{'archive_url': 'https://api.github.com/repos/mojombo/grit/{archive_format}{/ref}',
  'assignees_url': 'https://api.github.com/repos/mojombo/grit/assignees{/user}',
  'blobs_url': 'https://api.github.com/repos/mojombo/grit/git/blobs{/sha}',
  'branches_url': 'https://api.github.com/repos/mojombo/grit/branches{/branch}',
  'collaborators_url': 'https://api.github.com/repos/mojombo/grit/collaborators{/collaborator}',
  'comments_url': 'https://api.github.com/repos/mojombo/grit/comments{/number}',
  'commits_url': 'https://api.github.com/repos/mojombo/grit/commits{/sha}',
  'compare_url': 'https://api.github.com/repos/mojombo/grit/compare/{base}...{head}',
  'contents_url': 'https://api.github.com/repos/mojombo/grit/contents/{+path}',
  'contributors_url': 'https://api.github.com/repos/mojombo/grit/contributors',
  'deployments_url': 'https://api.github.com/repos/mojombo/grit/deployments',
  'description': '**Grit is no longer maintained. Check out libgit2/rugged.** Grit gives you object

In [33]:
d[0]

{'archive_url': 'https://api.github.com/repos/mojombo/grit/{archive_format}{/ref}',
 'assignees_url': 'https://api.github.com/repos/mojombo/grit/assignees{/user}',
 'blobs_url': 'https://api.github.com/repos/mojombo/grit/git/blobs{/sha}',
 'branches_url': 'https://api.github.com/repos/mojombo/grit/branches{/branch}',
 'collaborators_url': 'https://api.github.com/repos/mojombo/grit/collaborators{/collaborator}',
 'comments_url': 'https://api.github.com/repos/mojombo/grit/comments{/number}',
 'commits_url': 'https://api.github.com/repos/mojombo/grit/commits{/sha}',
 'compare_url': 'https://api.github.com/repos/mojombo/grit/compare/{base}...{head}',
 'contents_url': 'https://api.github.com/repos/mojombo/grit/contents/{+path}',
 'contributors_url': 'https://api.github.com/repos/mojombo/grit/contributors',
 'deployments_url': 'https://api.github.com/repos/mojombo/grit/deployments',
 'description': '**Grit is no longer maintained. Check out libgit2/rugged.** Grit gives you object oriented re

In [34]:
d[0]["description"]

'**Grit is no longer maintained. Check out libgit2/rugged.** Grit gives you object oriented read/write access to Git repositories via Ruby.'

In [37]:
# To construct a list of names, descriptions, and url

# remember we need a list of records
df_data = []

# you can loop lists
for x in d:
    data_item = {
        "name": x["name"],
        "description": x["description"],
        "url": x["url"],
    }
    df_data.append(data_item)

pd.DataFrame(df_data)

Unnamed: 0,description,name,url
0,**Grit is no longer maintained. Check out libg...,grit,https://api.github.com/repos/mojombo/grit
1,Merb Core: All you need. None you don't.,merb-core,https://api.github.com/repos/wycats/merb-core
2,The Rubinius Language Platform,rubinius,https://api.github.com/repos/rubinius/rubinius
3,Ruby process monitor,god,https://api.github.com/repos/mojombo/god
4,Awesome JSON,jsawesome,https://api.github.com/repos/vanpelt/jsawesome
5,A JavaScript BDD Testing Library,jspec,https://api.github.com/repos/wycats/jspec
6,Unmaintained. Sorry.,exception_logger,https://api.github.com/repos/defunkt/exception...
7,include Enumerable — Unmaintained,ambition,https://api.github.com/repos/defunkt/ambition
8,Generates common user authentication code for ...,restful-authentication,https://api.github.com/repos/technoweenie/rest...
9,Treat an ActiveRecord model as a file attachme...,attachment_fu,https://api.github.com/repos/technoweenie/atta...


In [38]:
# Using list comprehensions
# You can do what we did last cell with a single line
df_data = [
    {
        "name": x["name"],
        "description": x["description"],
        "url": x["url"],
    } for x in d]


pd.DataFrame(df_data)

Unnamed: 0,description,name,url
0,**Grit is no longer maintained. Check out libg...,grit,https://api.github.com/repos/mojombo/grit
1,Merb Core: All you need. None you don't.,merb-core,https://api.github.com/repos/wycats/merb-core
2,The Rubinius Language Platform,rubinius,https://api.github.com/repos/rubinius/rubinius
3,Ruby process monitor,god,https://api.github.com/repos/mojombo/god
4,Awesome JSON,jsawesome,https://api.github.com/repos/vanpelt/jsawesome
5,A JavaScript BDD Testing Library,jspec,https://api.github.com/repos/wycats/jspec
6,Unmaintained. Sorry.,exception_logger,https://api.github.com/repos/defunkt/exception...
7,include Enumerable — Unmaintained,ambition,https://api.github.com/repos/defunkt/ambition
8,Generates common user authentication code for ...,restful-authentication,https://api.github.com/repos/technoweenie/rest...
9,Treat an ActiveRecord model as a file attachme...,attachment_fu,https://api.github.com/repos/technoweenie/atta...


# Python Social Media clients
- [Tweepy](http://docs.tweepy.org)
- [Facebook SDK](https://facebook-sdk.readthedocs.io/en/latest/)
- [Python Linkedin](http://ozgur.github.io/python-linkedin/)
- [Instagram](https://github.com/facebookarchive/python-instagram)
- [Telegram](https://github.com/LonamiWebs/Telethon)
- [Youtube](https://github.com/youtube/api-samples/tree/master/python)
- [Tumblr](https://github.com/tumblr/pytumblr)
- [Reddit](https://praw.readthedocs.io/en/latest/)
- [Quora](https://github.com/csu/pyquora)
- [Github](https://developer.github.com/v3/)

# Twitter Access

## Using tweepy

install using:

```pip install tweepy```

You will also need to setup credentials for your application from [apps.twitter.com](apps.twitter.com)

In [40]:
import tweepy

# The following values are needed for authentication
# They can be obtained from apps.twitter.com
# I have hidden them for security reasons
consumer_key = "**HIDDEN**"
consumer_secret = "**HIDDEN**"
access_token = "**HIDDEN**"
access_token_secret = "**HIDDEN**"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# create the authenticated api object
api = tweepy.API(auth)

In [41]:
# Now you can fetch the data that you need
# let's fetch the followers for @zainkuwait
zain_followers = api.followers("zainkuwait")

`zain_followers` will contain a list of User objects

To know what attributes the users will have you can check the [user](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object) object reference from twitter

In [46]:
# Lets construct a list of records using list comprehension

follower_data_list = [
    {
        # Remeber, User is an object not a dictionary
        "screen_name": x.screen_name,
        "location": x.location,
        "description": x.description,
        "protected": x.protected,
        "followers_count": x.followers_count,
        "favourites_count": x.favourites_count,
        "statuses_count": x.statuses_count,
        "created_at": x.created_at,
    } for x in zain_followers
]

pd.DataFrame(follower_data_list)

Unnamed: 0,created_at,description,favourites_count,followers_count,location,protected,screen_name,statuses_count
0,2017-11-18 15:55:35,,0,0,,False,tninhszizuhqd41,0
1,2017-11-12 20:13:57,,0,0,,False,jvfTv4KIBgzZUla,0
2,2017-11-18 15:38:16,,0,0,,False,mRgZgj6rETxAP0u,0
3,2017-11-18 15:22:06,,1,4,,False,8Dlb6cTzeNFtulj,0
4,2012-06-14 05:37:50,,1996,3,kuwait,True,Q8_lanko,0
5,2017-11-16 18:33:53,,0,6,"الاحمدى, دولة الكويت",False,saidfaw79658353,0
6,2017-08-31 14:20:41,,0,2,,False,Oussama35543613,0
7,2017-11-18 14:54:20,,0,0,,False,sherlysimpson11,0
8,2017-11-13 18:53:38,,48,27,دولة الكويت,False,abdulla13469511,3
9,2016-10-18 11:05:30,,82,50,Kuwait,True,AmrKhal42267963,52


In [70]:
# lets place the data in a variable
x = pd.DataFrame(follower_data_list)

# lets find out how many user zainkuwait has
zain_kw = api.get_user("zainkuwait")

zain_kw.followers_count

519510

# Rate limiting
With such a large number twitter will prevent us from fetching all the users. This is known as **Rate Limiting**

The way around it is:
- Using cursors to fetch multiple pages
- Fetching data over a longer period of time
- Using multiple computers and apps to fetch different pages
- Combining the collected data

In [None]:
# lets fetch 3 pages only
# each page will contain 20 followers
# we can change that to a maximum of 200

follower_df = None
for followers_page in tweepy.Cursor(api.followers, screen_name="zainkuwait", count=200).pages(3):
    # construct a dataframe with every page
    follower_data_list = [
        {
            # Remeber, User is an object not a dictionary
            "screen_name": x.screen_name,
            "location": x.location,
            "description": x.description,
            "protected": x.protected,
            "followers_count": x.followers_count,
            "favourites_count": x.favourites_count,
            "statuses_count": x.statuses_count,
            "created_at": x.created_at,
        } for x in followers_page
    ]

    if follower_df is None:
        follower_df = pd.DataFrame(follower_data_list)
    else:
        follower_df = follower_df.append(pd.DataFrame(follower_data_list), ignore_index=True)

In [87]:
len(follower_df)

600

In [100]:
follower_df.head(10)

Unnamed: 0,created_at,description,favourites_count,followers_count,location,protected,screen_name,statuses_count
0,2017-11-13 00:47:15,,0,0,بيتي,False,Ketman_8lm,4
1,2017-11-18 15:08:23,,0,0,,False,zN0eAegNnhq6MbM,0
2,2017-11-18 15:55:35,,0,0,,False,tninhszizuhqd41,0
3,2017-11-12 20:13:57,,0,0,,False,jvfTv4KIBgzZUla,0
4,2017-11-18 15:38:16,,0,0,,False,mRgZgj6rETxAP0u,0
5,2017-11-18 15:22:06,,1,4,,False,8Dlb6cTzeNFtulj,0
6,2012-06-14 05:37:50,,1996,3,kuwait,True,Q8_lanko,0
7,2017-11-16 18:33:53,,0,6,"الاحمدى, دولة الكويت",False,saidfaw79658353,0
8,2017-08-31 14:20:41,,0,3,,False,Oussama35543613,0
9,2017-11-18 14:54:20,,0,0,,False,sherlysimpson11,0


In [101]:
follower_df.tail(10)

Unnamed: 0,created_at,description,favourites_count,followers_count,location,protected,screen_name,statuses_count
590,2017-11-12 00:20:19,أشهد أن لا إله إلا الله وأشهد أن سيدنا محمد عب...,1,12,"السالمية, دولة الكويت",False,Ibrahim74588329,0
591,2017-11-11 17:45:51,الأخبار العامة أخبار العالم العلوم و التكنولوجيا,101,10,,False,laila_aleidan,16
592,2011-11-22 22:37:05,,0,13,,False,ibad80,1
593,2017-11-11 23:25:33,,0,1,,False,Y69fP456EU8ND9V,0
594,2017-11-11 22:34:57,,0,12,Kuwait,False,moalsuwailem1,0
595,2017-10-08 21:51:35,,0,2,,False,Tbmf3,0
596,2017-11-11 22:16:57,My Gob,1,2,,False,MuzafforIslam1,4
597,2017-11-11 22:10:36,,0,0,,False,gC9DV7GIahGn2QM,0
598,2013-10-08 23:36:08,#HackedByJM511 @T4TBHH,2111,274,,False,desinger_j84,380
599,2017-10-30 22:03:52,"Interior Designer , Artist , proud MOM 🕊",0,34,"Mishrif, Kuwait",False,AbrarMuhsen,2


In [102]:
follower_df.sample(10)

Unnamed: 0,created_at,description,favourites_count,followers_count,location,protected,screen_name,statuses_count
131,2017-10-12 20:27:56,‏وقل اعملوا فسيرى الله عملكم,168,31,,False,1BkicRwkk8s2go4,42
283,2017-11-15 07:15:39,,9,10,,False,stapleford1984,0
264,2017-11-14 09:54:23,,21,11,,False,x50jeAwF0urESzl,38
235,2017-06-10 03:12:21,,166,14,,False,mazen_zazo66,123
299,2017-11-14 22:45:36,,2,7,,False,noor19816,28
525,2017-11-12 18:18:52,,1,25,,False,ygjjgkihnjjb,3
487,2017-10-13 02:56:52,"I sneak drinks into movie theatres, Meditation...",0,7,"Windham, NH",False,JenjenSheeza,2
191,2017-11-16 11:09:31,اللهم لا تجعل ذكر أمي ينقطع وسخر لها الدعوات ط...,0,41,Kuwait,False,dalalii_32,6
560,2013-05-29 20:58:11,ألحوآر مع آلجھلاء ' كآلرسم على ميآھ آلبحر ! مھ...,0,280,Kuwait,False,hoda7647675,850
459,2013-06-08 20:19:16,لن نخضع والله خيرا حافظا,0,164,q8,False,tarekelfahhed,966


# Web Scrapping

- Need know how HTML is constructed

```HTML
<Openning tag attr="value" attr2="value2"> text </closing tag>
```

# Typical HTML tags

```HTML
<H1>Headline 1</H2>
```
```HTML
<a href="google.com">Link to Google</a>
```
```HTML
<div class="content">main content text</div>
```



# HTML Can Be Nested

```HTML
<H1>
    Headline 1
    <div class="content">
        main content text
        <a href="google.com">Link inside main content</a>   
    </div>
    
</H2>

```



# How Scrapping Works
1. Using the browser, you examine the source of the content to find the information you need and identify the tags and attributes associated with the information you want
2. You load the HTML page as text using **requests**
3. You parse the HTML text using the scrapping tool
4. Using the scrapping tool, you fetch the items that meet the criteria you identified from 1
5. Repeat and test using jupyter notebook to ensure you captured the data correctly

# Beautiful Soup

Used easily pick information from an HTML or XML document

Install using:

``` pip install beautifulsoup4 ```

In [89]:
import requests
from bs4 import BeautifulSoup as BS

# Load html page using requests
res = requests.get("http://www.alanba.com.kw/newspaper/")

# Parse the text of the webpage using BS
bs = BS(res.text, 'html.parser')

# Now examine the page using the developer toolt to find the tags and attributes you need

# From the Browser Developer Tools
## We discovered that we need the text matching the criteria:
- Under the h2 tag of class ```page-title CenterTextAlign```
- The path will be the href for a tag the mentioned h2 tag
- The title text will be the text for the same a tag

In [90]:
# fetch all h2 tags of class page-title CenterTextAlign
h2_tags = bs.find_all("h2", class_="page-title CenterTextAlign")

In [91]:
# The data looks correct
h2_tags

[<h2 class="page-title CenterTextAlign"><a href="/ar/kuwait-news/791415/18-11-2017-هيئة-مواجهة-الأزمات-بمباركة-حكومية/">هيئة مواجهة الأزمات.. بمباركة حكومية</a></h2>,
 <h2 class="page-title CenterTextAlign"><a href="/ar/kuwait-news/791401/18-11-2017-المرزوق-أمطار-مثيل-لها-الثلاثاء-المقبل-وتستمر-خفيفة-ومتفرقة-حتى-ديسمبر/">المرزوق: أمطار لا مثيل لها الثلاثاء المقبل وتستمر خفيفة ومتفرقة حتى 2 ديسمبر</a></h2>,
 <h2 class="page-title CenterTextAlign"><a href="/ar/kuwait-news/791371/18-11-2017-بالفيديو-الطيران-الشراعي-زين-سماء-الكويت-بـ-سمعا-وطاعة-يا-صاحب-السمو/">بالفيديو.. الطيران الشراعي زيَّن سماء<br/> الكويت بـ «سمعاً وطاعة يا صاحب السمو»</a></h2>,
 <h2 class="page-title CenterTextAlign"><a href="/ar/kuwait-community/occasions-events/791275/18-11-2017-سلطنة-عمان-تحتفل-اليوم-بعيدها-الوطني-نهضة-تنموية-مستمرة-بقيادة-السلطان-قابوس/">سلطنة عُمان تحتفل اليوم بعيدها الوطني الـ 47: نهضة تنموية مستمرة بقيادة السلطان قابوس</a></h2>,
 <h2 class="page-title CenterTextAlign"><a href="/ar/kuwait-news/

In [92]:
# Lets experiment how to get the text
h2_tags[0].text

'هيئة مواجهة الأزمات.. بمباركة حكومية'

In [96]:
# now the url path
h2_tags[0].a.attrs["href"]

'/ar/kuwait-news/791415/18-11-2017-هيئة-مواجهة-الأزمات-بمباركة-حكومية/'

In [97]:
# Now use list comprehension to construct your list of records
headline_list = [
    {
        "title": x.text,
        "path": x.a.attrs["href"],
    } for x in h2_tags
]
pd.DataFrame(headline_list)

Unnamed: 0,path,title
0,/ar/kuwait-news/791415/18-11-2017-هيئة-مواجهة-...,هيئة مواجهة الأزمات.. بمباركة حكومية
1,/ar/kuwait-news/791401/18-11-2017-المرزوق-أمطا...,المرزوق: أمطار لا مثيل لها الثلاثاء المقبل وتس...
2,/ar/kuwait-news/791371/18-11-2017-بالفيديو-الط...,بالفيديو.. الطيران الشراعي زيَّن سماء الكويت ب...
3,/ar/kuwait-community/occasions-events/791275/1...,سلطنة عُمان تحتفل اليوم بعيدها الوطني الـ 47: ...
4,/ar/kuwait-news/791387/18-11-2017-بالفيديو-حرك...,بالفيديو.. حركة «BDS الكويت»: المقاطعة أداة مه...
5,/ar/kuwait-news/791408/18-11-2017-بالفيديو-معر...,بالفيديو.. معرض الكتاب يزيل الحواجز بين الطفل ...
6,/ar/arabic-international-news/lebanon-news/791...,الحريري بعد لقاء ماكرون: سأحتفل بعيد الاستقلال...
7,/ar/kuwait-news/791437/19-11-2017-وزير-الإعلام...,المذيع وليد المؤمن في ذمة الله
8,/ar/kuwait-news/791461/19-11-2017-الحربي-السما...,جمال الحربي: لن نفرج عن اي منتج زراعي مصري الا...
9,/ar/last/791450/19-11-2017-بالفيديو-شاهدت-أحدا...,بالفيديو.. هل شاهدت أحداً محظوظاً كهذا الصبي؟ ...


# Saving Dataframes

You can save your data frames on file using the to_* methods

In [103]:
# After creating the list of headlines
mydf = pd.DataFrame(headline_list)

# store as csv
mydf.to_csv("headlines.csv")

# you can store as json
mydf.to_json("headlines.json")

# or even excel
mydf.to_excel("headlines.xls")

# Which Approach to use?
1. Use client if exists
2. Use restful API if no client exists
3. Scrape if you have to
 - Assumes stable structure