<h1><center>Introduction to Data Science</center></h1>

<h1><center> Berif Intro to Pandas</center></h1>



# Data Wrangling For Data Science

# Pandas highlights
- A fast and efficient DataFrame object 
- Tools for reading and writing data between in-memory data structures and different formats
- Intelligent data alignment and integrated handling of missing data
- Flexible reshaping and pivoting of data sets;
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
- Columns can be inserted and deleted from data structures 
- Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- Time series-functionality: date range generation and frequency conversion, moving window statistics, moving - window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- Highly optimized for performance, with critical code paths written in  C.


# Introduction to pandas series
- A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index

In [44]:
import pandas as pd
import numpy as np

data = np.random.randn(5)
index = ['a','b','c','d','e']
S = pd.Series(data, index)
print(S)
print(S.values)
print(S.index)

a    0.993406
b   -0.132637
c   -0.758387
d   -1.384529
e    1.893290
dtype: float64
[ 0.99340612 -0.13263738 -0.75838718 -1.3845294   1.89328971]
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')


# Note 1 on series
- If data is a dict, if index is passed the values in data corresponding to the labels in the index will be pulled out. Otherwise, an index will be constructed from the sorted keys of the dict, if possible.

In [2]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
pd.Series(d)

a    0.0
b    1.0
c    2.0
dtype: float64

In [3]:
pd.Series(d, index=['b', 'c', 'd', 'a'])


b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

# Series are much like arrays.
- valid arguments to most NumPy functions
- series is like dict
- Key difference between Array and Series

# Vectorized operations and label alignment with Series

In [5]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s+s
s*s

a    0.001187
b    0.152606
c    2.447825
d    0.951420
e    3.927580
dtype: float64

# Key difference between array and series
- Series automatically align the data based on label

In [6]:
s[1:]+s[:-1]

a         NaN
b   -0.781297
c   -3.129106
d    1.950815
e         NaN
dtype: float64

#  Series name attribute
- Series can also have a name attribute:

In [7]:
s = pd.Series(np.random.randn(5), name='something')

s.name

'something'

# The most import Pandas data object: Data frames
- A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index

# Data frame acceptable inputs:
- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame

In [2]:
import pandas as pd
import numpy as np
DF1 = pd.DataFrame(np.random.randint(low=0, high=10, size=(500, 5)),
                   columns=['a', 'b', 'c', 'd', 'e'])
DF1.head(5)


Unnamed: 0,a,b,c,d,e
0,9,6,5,7,6
1,8,0,1,2,7
2,3,7,1,1,8
3,4,1,9,4,1
4,3,5,1,6,4


In [47]:
DF1 = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
                   index = ['t1','t2','t3','t4','t5'],
                   columns=['a', 'b', 'c', 'd', 'e'])
DF1

Unnamed: 0,a,b,c,d,e
t1,7,0,5,2,8
t2,2,0,2,7,6
t3,0,1,5,7,0
t4,8,5,7,8,8
t5,8,4,3,3,0


# Column selection, addition , deletion

In [3]:
# sum
DF1['f'] = DF1['a']+DF1['d']
DF1

X = DF1.drop(['d'],axis = 1)

#A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute

X




Unnamed: 0,a,b,c,e,f
0,9,6,5,6,16
1,8,0,1,7,10
2,3,7,1,8,4
3,4,1,9,1,8
4,3,5,1,4,9
5,7,4,1,8,11
6,3,0,6,8,7
7,8,3,5,0,12
8,9,0,5,7,18
9,7,2,0,4,14


In [67]:
df = pd.DataFrame([('falcon', 'bird',    389.0),
...                    ('parrot', 'bird',     24.0),
...                    ('lion',   'mammal',   80.5),
...                    ('monkey', 'mammal', np.nan)],
...                   columns=('name', 'class', 'max_speed'))

x = df.pop('class')
x

0      bird
1      bird
2    mammal
3    mammal
Name: class, dtype: object


# Selection with loc and iloc
Y = df.iloc[:,2]
print(Y)

In [71]:
Y = df.iloc[:1,1]

print(Y)

0    389.0
Name: max_speed, dtype: float64


# Function Application and Mapping
- NumPy ufuncs (element-wise array methods) also work with pandas objects

In [53]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
.....: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.229799,0.117755,0.209891
Ohio,0.245241,0.189417,1.007792
Texas,0.09903,0.431307,1.621503
Oregon,0.932759,0.34183,0.138251


- Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this:

In [54]:
f = lambda x: x.max() - x.min()
frame.apply(f)

b    1.130769
d    0.773136
e    2.629296
dtype: float64

- Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.


In [73]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,0.09903,-0.34183,-1.621503
max,1.229799,0.431307,1.007792


# Loading and Handling Time Series in Pandas


# Working on AirPassengers data set:
- The classic Box & Jenkins airline data. Monthly totals of international airline passengers, 1949 to 1960.

- source AARSHAY JAIN 



# First step

In [None]:
import pandas as pd
import numpy as np
import datetime as dtime
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6

# Now, we can load the data set and look at some initial rows and data types of the columns:

In [74]:
filepath = '/Users/martin/Documents/MyLecturesSBU/Fall2018/CSE391/data/AirPassengers.csv'
data = pd.read_csv(filepath)
print(data.head(10))
print('\n Data Types:')
print(data.dtypes)

     Month  #Passengers
0  1949-01          112
1  1949-02          118
2  1949-03          132
3  1949-04          129
4  1949-05          121
5  1949-06          135
6  1949-07          148
7  1949-08          148
8  1949-09          136
9  1949-10          119

 Data Types:
Month          object
#Passengers     int64
dtype: object


In [4]:
?pd.read_csv

#  Dealing with missing values

# Two type of missing data
- missing completly at random
- mising not at random

# How to deal with missing data
- drop out 
- imputation 
     - expert gussing
     - averaging 
     - regression
     - Expectation maximization


# 1.1 Drop the missing values.

In [87]:
filepath = '/Users/martin/Documents/MyLecturesSBU/Fall2018/CSE391/data/titanic_train.csv'
passengers = pd.read_csv(filepath)
passengers.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

passengers.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


- This will drop any rows with missing values. Clearly this isn't a good idea
- What instead if we wanted to remove any columns with missing values? How?

In [88]:
passengers.dropna(axis = 1).head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.925
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1
4,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.05


In [None]:
# Keep only the rows with at least 11 non-na values:
passengers.dropna(thresh = 11)

# 1.2 imputation ( a way to replace the missing values) 
- Filling the missing values
    -  mean
    - your suggestion?

In [None]:
 passengers["Age"].fillna(value=passengers["Age"].mean()).head()

# The apply function in Pandas
-This is one of the most powerful tools available in Pandas. Apply, allows you to either use Python's built-in functions or to create your own custom function and then run it across a set of your data.

$x \in R^N$