# Data Exploration with iPython Jupyter

## Quick Introduction

This is a iPython [Jupyter](http://jupyter.org/) Notebook as the one we have seen during the class. 
You can visualize and work with it simply running:

- (Linux/OSX) `ipython notebook` from a terminal and then search in the visual interface for the current file
- (OSX) if you have installed [Pineapple](https://nwhitehead.github.io/pineapple/) just open this file in from Finder
- (any OS) install [Anaconda](https://www.continuum.io/downloads) and then open the `Launcher` that you can find linked to the Desktop (Linux/OSX) or just searching for it (Windows).

Ref. [bit.ly/BTS-S3-DataExploration-Data](http://bit.ly/BTS-S3-DataExploration-Data)

## Import Libraries

Let's import some of the libraries that we are going to use in this notebook

In [2]:
import pandas as pd
import numpy as np
from IPython.display import display

### Configuring IPython

The first two lines limit the number of *warnings* that python will give you. 

This helps to keep the code clean, and it is done by importing the library `warnings` and then calling the function `filterwarnings`. 

The other three lines instead define some visual settings of the library `pd` that is just an alias for `pandas`. 

(Note that all these lines are not mandatory)

In [5]:
import warnings
warnings.filterwarnings("ignore") 
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 200)
pd.set_option('display.max_colwidth', 100)

## Data Exploration

Let's start by importing the dataset into a **DataFrame** that is a pandas object optimized for data analysis.

You can find other examples of analysis with Python Pandas in the following links

- [easy introduction to DataFrame with Pandas](http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/)
- [introduction to Pandas](http://synesthesiam.com/posts/an-introduction-to-pandas.html)
- [Pandas Homepage](http://pandas.pydata.org/)

In [6]:
rawdf = pd.read_csv('../../data/DF-Data_twitter_vine/twitter-vine-clean.tsv.bz2', sep='\t', compression='bz2')

### Quick Start with Data Exploration

Print some information about the dataset:

In [11]:
print('See number of rows and number of columns:')
print(rawdf.shape)

print('See the top 5 lines:')
rawdf.head(5)

See number of rows and number of columns:
(2129755, 22)
See the top 5 lines:


Unnamed: 0,created_at,tweet_id,text,source,favorited,retweeted,possibly_sensitive,lang,geo,coordinates,hashtags,urls,user_mentions,in_reply_to_user_id,retweet_count,user_id,user_name,user_location,user_followers_count,user_friends_count,user_created_at,user_verified
0,2013-01-28T13:41:54+00:00,295889441355071488,BBC News - Twitter launches Vine micro-video sharing service http://t.co/MykDuEgx,"<a href=""http://twitter.com/tweetbutton"" rel=""nofollow"">Tweet Button</a>",False,False,False,en,,,[],"[{'url': 'http://t.co/MykDuEgx', 'indices': [61, 81], 'display_url': 'bbc.in/148Grra', 'expanded...",[],,0,632569755,Sherry Edwards,US,8,45,2012-07-11T02:01:19+00:00,False
1,2013-01-28T13:41:57+00:00,295889452121866241,“Vine” อลวน โป๊ระบาด-เฟซบุ๊กสั่งแบน-ทวิตเตอร์เซ็ง http://t.co/mIo9PO6t,"<a href=""http://www.facebook.com/twitter"" rel=""nofollow"">Facebook</a>",False,False,False,th,,,[],"[{'url': 'http://t.co/mIo9PO6t', 'indices': [50, 70], 'display_url': 'manager.co.th/Cyberbiz/Vie...",[],,0,94285151,Keng KSPstudio,,165,736,2009-12-03T08:18:58+00:00,False
2,2013-01-28T13:41:58+00:00,295889454705565696,"Really excited for the explosion of 'What does Vine mean for _______?' pieces, which will dodge ...",web,False,False,,en,,,[],[],[],,0,21334099,Ryan Farkas,Toronto,367,841,2009-02-19T20:13:33+00:00,False
3,2013-01-28T13:41:58+00:00,295889458639802368,It's fun to think about the potential that exists between #vine and those who have DIY #Etsy bus...,web,False,False,,en,,,"[{'indices': [58, 63], 'text': 'vine'}, {'indices': [87, 92], 'text': 'Etsy'}]",[],"[{'id': 18288876, 'id_str': '18288876', 'screen_name': 'cary_weston', 'name': 'Cary Weston', 'in...",,0,317935518,Pat Lemieux,Bangor Maine,354,383,2011-06-15T18:14:44+00:00,False
4,2013-01-28T13:42:01+00:00,295889470333517824,Just Vined Lets You See the Last 20 Videos on Vine http://t.co/5tqQtM92,web,False,False,False,en,,,[],"[{'url': 'http://t.co/5tqQtM92', 'indices': [51, 71], 'display_url': 'bit.ly/Vrvb87', 'expanded_...",[],,0,736658053,adelineuddbbilo,Tampa,43,99,2012-08-04T12:33:45+00:00,False


In [9]:
print('Get a better overview about the numerical (and ONLY numerical) columns:')
rawdf.describe()

Get a better overview about the numerical (and ONLY numerical) columns:


Unnamed: 0,tweet_id,favorited,retweet_count,user_followers_count
count,2129755.0,2129755,2129755.0,2129755.0
mean,3.010565e+17,0,63338.62,3419.690028
std,3075295000000000.0,0,6094841.0,55999.003897
min,2.958894e+17,False,0.0,-1.0
25%,2.985332e+17,0,0.0,107.0
50%,3.010991e+17,0,0.0,264.0
75%,3.037247e+17,0,0.0,706.0
max,3.061918e+17,False,1184210000.0,8894046.0


### Focus on an Individual Dimension

#### Column "LANG"

Discovering:

- *groupby*
- *apply*
- *sort*

In [10]:
# group each row by the value in 'lang' and apply to each group the function (len)
pd.DataFrame(rawdf.groupby(['lang']).apply(len))

# anyway without assign it to any variable, we are losing it.. So let's save it into a new dataframe:
tmpdf = pd.DataFrame(rawdf.groupby(['lang']).apply(len))

# note that 'len' is a simple function that count the number of raw, for example: 
print('The number of row of our dataframe are:', len(rawdf))

# let's sort the new dataframe by column value
tmpdf.sort(columns=[0], ascending=False).head()

The number of row of our dataframe are: 2129755


Unnamed: 0_level_0,0
lang,Unnamed: 1_level_1
en,418018
es,149314
und,33008
pt,20961
it,15414


We can see that the most common language in our dataset is 'en' with 418'018 tweets. 

Question: what's `und`?

You can try to understand what it is, just printing the rows that have the `lang` field equal to `und`, 
or even better just printing the content of the tweets of those rows, as below:

In [5]:
# filtering by 'lang' value:
display(rawdf[rawdf.lang=='und'].head(2)) 
# NOtE: if you want to print something that is not the last print of the current cell,
# then you need to call the 'display' method

# The previous command (without the 'display', returns a dataframe.. 
# so you can print only the columns you want from that dataframe:
rawdf[rawdf.lang=='und'][['text', 'lang']].head(5)

Unnamed: 0,created_at,tweet_id,text,source,favorited,retweeted,possibly_sensitive,lang,geo,coordinates,...,user_mentions,in_reply_to_user_id,retweet_count,user_id,user_name,user_location,user_followers_count,user_friends_count,user_created_at,user_verified
8,2013-01-28T13:42:03+00:00,295889478831198208,http://t.co/G5vwi7u3 http://t.co/Sx2tn2wV,"<a href=""http://www.facebook.com/twitter"" rel=...",False,False,False,und,,,...,[],,0,249921061,MI.COM.CO,Colombia,1774,1958,2011-02-10T01:49:43+00:00,False
36,2013-01-28T13:42:31+00:00,295889593402814465,http://t.co/q8y5mJUk,"<a href=""http://vine.co"" rel=""nofollow"">Vine f...",False,False,False,und,,,...,[],,0,606384172,Alpaca Farm,,142,271,2012-06-12T14:59:12+00:00,False


Unnamed: 0,text,lang
8,http://t.co/G5vwi7u3 http://t.co/Sx2tn2wV,und
36,http://t.co/q8y5mJUk,und
72,http://t.co/nmnD8o6L,und
96,http://t.co/gdCVRkT6,und
99,http://t.co/CyUFYjAY,und


#### Column "SOURCE"

Let's see how to apply a custom function..

In [20]:
# this function receive a single raw because we are not grouping..
def myfunc(line):
    if 'web' in line:
        return 'yes'
    else:
        return 'no'

# let's just see in which line contains the word "web",
# but only for the first 10 lines of the dataframe to make it quicker

# therefore, this is the way to select the first 10 lines:
rawdf['source'][0:10]

# and then
rawdf['source'][0:10].apply(myfunc)

0     no
1     no
2    yes
3    yes
4    yes
5    yes
6    yes
7     no
8     no
9     no
Name: source, dtype: object

#### Something more complicated..

Let's find the line that contain a link, such as:

```
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
```

and let's extract the text inside '`>`' and '`<`' using some string functionalities.

In [22]:
# let's see how process a line such as the following:
line = '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'

# first we want to see if hte line starts as a link:
if line.startswith('<a href'):
    # and in this case we want to split it by '<' first:
    substrings = line.split('>')
    # the method "split" return a list of string that were on the side of '<'
    print('this is the list:', substrings)
    # we know that what we want is the second element (and since python starts counting from 0..)
    print('this is the substring after "<":', substrings[1])
    # and let's split again but this time let's take the other one
    print('this is the substring after "<" but before ">":', substrings[1].split('<')[0])

this is the list: ['<a href="http://twitter.com/download/iphone" rel="nofollow"', 'Twitter for iPhone</a', '']
this is the substring after "<": Twitter for iPhone</a
this is the substring after "<" but before ">": Twitter for iPhone


In [26]:
# extract source from string, but before let's see how many time we have a link
print('Total source entries:', len(rawdf))
flt_df = rawdf[rawdf.source.str.contains('a href')].source
print('Source entries that contains link:', len(flt_df))

# let's take the first 1000 lines of our dataframe
df = rawdf[:1000]

# remove html code
def extractLinkName(source):
    if "href" in source:
        return source.split('>')[1].split('<')[0]
    else:
        return source

# we can even replace the same column with the one without the html code:
df['source'] = df['source'].apply(extractLinkName)
df.head(3)

Total source entries: 2129755
Source entries that contains link: 1772648


Unnamed: 0,created_at,tweet_id,text,source,favorited,retweeted,possibly_sensitive,lang,geo,coordinates,hashtags,urls,user_mentions,in_reply_to_user_id,retweet_count,user_id,user_name,user_location,user_followers_count,user_friends_count,user_created_at,user_verified
0,2013-01-28T13:41:54+00:00,295889441355071488,BBC News - Twitter launches Vine micro-video sharing service http://t.co/MykDuEgx,Tweet Button,False,False,False,en,,,[],"[{'url': 'http://t.co/MykDuEgx', 'indices': [61, 81], 'display_url': 'bbc.in/148Grra', 'expanded...",[],,0,632569755,Sherry Edwards,US,8,45,2012-07-11T02:01:19+00:00,False
1,2013-01-28T13:41:57+00:00,295889452121866241,“Vine” อลวน โป๊ระบาด-เฟซบุ๊กสั่งแบน-ทวิตเตอร์เซ็ง http://t.co/mIo9PO6t,Facebook,False,False,False,th,,,[],"[{'url': 'http://t.co/mIo9PO6t', 'indices': [50, 70], 'display_url': 'manager.co.th/Cyberbiz/Vie...",[],,0,94285151,Keng KSPstudio,,165,736,2009-12-03T08:18:58+00:00,False
2,2013-01-28T13:41:58+00:00,295889454705565696,"Really excited for the explosion of 'What does Vine mean for _______?' pieces, which will dodge ...",web,False,False,,en,,,[],[],[],,0,21334099,Ryan Farkas,Toronto,367,841,2009-02-19T20:13:33+00:00,False


### What's next

We have seen a very quick introduction of Python and the Jupyter Notebook. 
Being able to master this program language with some of its libraries (such as `pandas`) will give you an extremely powerful tool to explore, analyze and exploit any type of data. 

You should also start to investigate some of the libraries that allow python to perform beautiful and clear visualization. There are many way to make visualization in python, some of these libraries are mentioned in the presentation, however I would strongly suggest to check the following:

- [ggplot2](http://ggplot2.org/) it is based on the Grammar of Graphics and it's the same library you can find using R
    - [ggplot.yhathq.com/](http://ggplot.yhathq.com/)
    - [Yhat/ggplot-for-python](http://www.slideshare.net/Yhat/ggplot-for-python)
- [seaborn](https://stanford.edu/~mwaskom/software/seaborn/) this is a very simple library to do various plot. You do not have the power of `ggplot2` but in some cases you prefer to plot something fast that still looks cool

*Enjoy*. 