In [1]:
%matplotlib inline

### Libaries

In [2]:
import pandas as pd
import numpy as np

# Data Managemet

## Content
1. Getting the data
2. Data Nomenclature
3. Getting data from files: Text (CSV) & MS Excel files
4. Getting data from Databases
5. Getting data from WWW: API access (Twitter), 
6. Getting data from WWW: Web Scrapping
7. Data Cleanup
8. Data Manipulation: Filtering, Transforming, Aggregating and Sorting

## 1. Getting the Data
We are entering the **age of data**

The challenge is no longer where to get data from, but **what to do with it**. 

The kind of information in each data set will **drive the type of research you perform**

**Data issues**:
* Quality & Reliability
* Integrity (accuracy & consistency)
* Missing values
* Record linkage (matching)
* Privacy

## 2. Data Nomenclature

### Semantic vs Type

**Data Semantics**: 
* The real-world meaning. 
* e.g., company name, day of the month, person height, etc. 

**Data Type**: 
* Interpretation in terms of scales of measurements 
* e.g., quantity or category, sensible mathematical operations, data structure, etc.

### Types

**Nominal (labels)**: 
* Operations: =, ≠
* eg. Apples, oranges, bananas…

**Ordinal (ordered)**: 
* Operations: =, ≠, >, <
* e.g. Small, medium, large

**Interval (location of zero arbitrary)**: 
* Operations: =, ≠, >, <, +, − (distance)
* eg. Dates: Jan 19; Location: (Lat, Long) 
* Like a geometric point. Cannot compare directly. 
* Only differences (i.e., intervals) can be compared

**Ratio (zero fixed)**: 
* Operations: =, ≠, >, <, +, −, ×, ÷ (proportions)
* eg. Measurements: Length, Mass,Temp, ... 
* Origin is meaningful, can measure ratios & proportions

Both last data types usually are unified as **Quantitative** 

### Some Concepts

* **Dataset:** a collection of data
* **Training set:**  set of data used to discover potentially predictive relationships.
* **Test set:** set of data used to assess the strength and utility of a predictive relationship.
* **Example or instance or item:** represent a fact or a data point. Usually a row of the dataset
* **Attributes or features or variables:** is an individual measurable property of a phenomenon being observed
* **Target attribute or variable:** whose values are to be predicted. Obviously the target variable is not used to predict itself

##3. Getting data from files: Text (CSV) & MS Excel files

### 3.1 Getting data form CSV

One of the simplest and most common ways of sharing data today is CSV (Comma-Separated Values).

CSV has become a standard file format used to exchange data between many different application.

CSV files usually have a .csv extension

In [3]:
titanicTrainingSet = pd.read_csv('../data/Titanic/train.csv', index_col=0)

In [4]:
titanicTrainingSet.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


In [5]:
titanicTrainingSet.ftypes

Survived      int64:dense
Pclass        int64:dense
Name         object:dense
Sex          object:dense
Age         float64:dense
SibSp         int64:dense
Parch         int64:dense
Ticket       object:dense
Fare        float64:dense
Cabin        object:dense
Embarked     object:dense
dtype: object

> Convert to category data type

In [6]:
titanicTrainingSet['Survived'] = titanicTrainingSet['Survived'].astype('category')
titanicTrainingSet['Pclass'] = titanicTrainingSet['Pclass'].astype('category')
titanicTrainingSet['Sex'] = titanicTrainingSet['Sex'].astype('category')
titanicTrainingSet['Embarked'] = titanicTrainingSet['Embarked'].astype('category')

In [7]:
titanicTrainingSet.ftypes

Survived    category:dense
Pclass      category:dense
Name          object:dense
Sex         category:dense
Age          float64:dense
SibSp          int64:dense
Parch          int64:dense
Ticket        object:dense
Fare         float64:dense
Cabin         object:dense
Embarked    category:dense
dtype: object

### 3.2 Getting data form MS Excel

MS Excel is a widespread tool in a lot of organisations. 

So it's very common the use of MS Excel to exchange data and information for two reason:
* Most of the enterprise tools allow export data to MS Excel
* Most non-technical people often use Excel as a database replacement.

## 4. Getting Data form WWW

### 4.1 Getting data form Twitter API

The REST APIs provide programmatic access to read and write Twitter data.

Author a new Tweet, read author profile and follower data, and more. 

The REST API identifies Twitter applications and users using OAuth; 

Responses are available in JSON.

More info in: https://dev.twitter.com/rest/public 

#### 4.1.1 Create a Twitter App
Follow the steps in the given order to create and authorize your app:

1. You must have a Twitter Account. If not create one @ twitter.com.
2. Log on to https://dev.twitter.com with same credential
3. Goto https://apps.twitter.com/ 
4. Click on "Create New App"
5. Give a Name to your App
6. Give a Description
7. You must create your own website to create this App.
8. No need to mention callback URL
9. Read the agreement and click on Agree button. 
10. Click on "Create your Twitter Application" 

#### 4.1.2 Get Permissions

1. Click on Permissions Tab
2. Change the permission to "Read Write and Access direct messages". Without changing the 
3. permission to Read Write and direct messages, you cant access the APIs. 
4. Changing access to this requires your phone number to be updated in twitter account.

#### 4.1.3 Generate access token:

1. Go to API Keys tab
2. Press Regenerate API keys button
3. Press Create my access token button to generate keys

Twitter API: https://github.com/tweepy/tweepy

In [9]:
import tweepy

In [10]:
consumer_key = 'TFnouCZ2xP9rioeWfCKxCA'
consumer_secret = '3gNa3IgwpA6wH5yZGSmFEJOjDMrWH1myme0HTTCB9oM'
access_token = '63407360-6nrmVTVppPveIN3peyS6NIJutBfoNOejHtaicuek'
access_token_secret = 'bUJTbtpXVPKqkCUvhqWq66fkkf35gcJ2n3FgmEg'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

In [11]:
api = tweepy.API(auth)

public_tweets = api.home_timeline()
for tweet in public_tweets:
    print tweet.text
 

Microcerebros para combatir el autismo http://t.co/pGQCUAqcJy http://t.co/L4utHSmkrh
RT @tumbIerposts: when ur mom says "go before I change my mind" http://t.co/9wPYfaX10a
RT @BigDataBlogs: ZDNet » Apache Atlas, Parquet progress; Whirr retired: Data governance and columnar storage ... http://t.co/j0JFeWT8VG #B…
Choose the right kind of Facebook Groups for your Business. http://t.co/FjSagDmgKz http://t.co/j77xFz87rC
RT @ravikikan: #Startup tip of today #startups #quote #quotes #growthhacking #smallbiz #wisdom #mktg #innovation #socialmedia #tech http://…
RT @wef: These are the most innovative countries in the world http://t.co/m2BsrJTJwH #innovation http://t.co/V1HHtJU8w0
Why good #business #intelligence needs good parenting http://t.co/ZStGsOxprW http://t.co/Po0it9dzHy
RT @janinebucks: Big data: The next frontier for innovation, competition, and productivity  | http://t.co/TSpahfFHV8 | Economics #free #ebo…
RT @yprez: Building a Movie Recommendation Service with Apache Spark &amp; Fla

In [12]:
user = api.get_user('jbaquerot')

In [13]:
user.entities

{u'description': {u'urls': []},
 u'url': {u'urls': [{u'display_url': u'about.me/jbaquerot',
    u'expanded_url': u'http://about.me/jbaquerot',
    u'indices': [0, 22],
    u'url': u'http://t.co/UmCed05SEE'}]}}

In [14]:
[tweet.text for tweet in api.search(q = "info_GMV")]

[u'\u3010\u767d\u732b\u3011\u4eca\u307e\u3067\u8abf\u5b50\u826f\u304f\u30ec\u30b9\u3057\u3066\u305d\u306e\u8cea\u554f\uff1f http://t.co/BTIJXIehvi',
 u'Guide prices for the new phase of apartments launching soon, start at \xa3350,000 for a one bedroom apartment, more info http://t.co/GUQzpiAFgz',
 u'Ada info kak,@ezak_gmv,MAU belajar Bahasa INGGRIS tanpa KURSUS?INFO KLIK https://t.co/OhhD7I6BD0 http://t.co/RuxIJQpmL8',
 u'Guide prices for the new phase of apartments launching soon, start at \xa3350,000 for a one bedroom apartment, more info http://t.co/GUQzpiAFgz',
 u'[\u65b0\u7740]: \u3010\u9ed2\u732b\u306e\u30a6\u30a3\u30ba\u3011\u52a9\u3063\u4eba\u304b\u3089\u7d42\u7109\u547c\u3079\u3070\u6bb4\u3089\u305a\u7d42\u308f\u308b\u3093\u3058\u3083\u306a\u3044\u3067\u3059\u304b\u306d\u305d\u308c http://t.co/14bmcM79hu',
 u'GMV DESARROLLAR\xc1 EL CENTRO DE CONTROL DEL SAT\xc9LITE PARA FINES BANCARIOS BRIsat http://t.co/rM2he9mK18']