# Data Analysis with Pandas (1st Part)

**Outline:**

* [Intro to Pandas](#Intro-to-Pandas)
* [Pandas Data Structures](#Pandas-Data-Structures)
  * [Series](#Series)
  * [DataFrame](#DataFrame)
* [Pandas Data Types](#Pandas-Data-Types)
* [Knowing Basic Stats](#Knowing-Basic-Stats)
* [Dealing with Files](#Dealing-with-Files)
  * [Reading Data from File](#Reading-Data-from-File)
  * [Writing Data to File](#Writing-Data-to-File)
* [Dealing with Columns](#Dealing-with-Columns)
  * [Renaming Columns](#Renaming-Columns)
  * [Adding New Columns](#Adding-New-Columns)
  * [Removing Existing Columns](#Removing-Existing-Columns)

## Intro to Pandas

In [1]:
from IPython.core.display import HTML
HTML("<iframe src=http://pandas.pydata.org width=800 height=350></iframe>")

In [2]:
import pandas as pd

## Pandas Data Structures

### Series

In [3]:
series_data = pd.Series([113, 1463, 95, 33])
series_data

0     113
1    1463
2      95
3      33
dtype: int64

In [4]:
type(series_data)

pandas.core.series.Series

In [5]:
series_data = pd.Series({'a': 113, 'b': 1463, 'c': 95, 'd': 33})
series_data

a     113
b    1463
c      95
d      33
dtype: int64

In [6]:
series_data = pd.Series({'a': 113, 'b': 1463, 'c': 95, 'd': 33}, index=['b', 'c', 'd', 'e', 'f'])  # 
series_data

b    1463.0
c      95.0
d      33.0
e       NaN
f       NaN
dtype: float64

In [7]:
series_data.isnull()

b    False
c    False
d    False
e     True
f     True
dtype: bool

In [8]:
series_data.index

Index(['b', 'c', 'd', 'e', 'f'], dtype='object')

In [9]:
series_data.values

array([ 1463.,    95.,    33.,    nan,    nan])

In [10]:
series_data + series_data

b    2926.0
c     190.0
d      66.0
e       NaN
f       NaN
dtype: float64

In [None]:
series_data.append(pd.Series([113, 1463, 95, 33]))

# DataFrame - most important

In [97]:
personal_data_dict = {
    'age': [39, 50, 38, 38],
    'education': ['Bachelors', 'Bachelors', 'HS-grad', 'Bachelors'],
    'occupation': ['Adm-clerical', 'Tech-support', 'Sales', 'Engineer'],
    'sex': ['Male', 'Female', 'Female', 'Male'],
    'capital-gain': [2174, 111, 993, 125]
}
df = pd.DataFrame(personal_data_dict)

In [98]:
print(personal_data_dict)

{'sex': ['Male', 'Female', 'Female', 'Male'], 'age': [39, 50, 38, 38], 'occupation': ['Adm-clerical', 'Tech-support', 'Sales', 'Engineer'], 'capital-gain': [2174, 111, 993, 125], 'education': ['Bachelors', 'Bachelors', 'HS-grad', 'Bachelors']}


In [36]:
type(df)

pandas.core.frame.DataFrame

In [37]:
df.shape

(4, 5)

In [38]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [39]:
df.values

array([[39, 2174, 'Bachelors', 'Adm-clerical', 'Male'],
       [50, 111, 'Bachelors', 'Tech-support', 'Female'],
       [38, 993, 'HS-grad', 'Sales', 'Female'],
       [38, 125, 'Bachelors', 'Engineer', 'Male']], dtype=object)

In [40]:
df.columns

Index(['age', 'capital-gain', 'education', 'occupation', 'sex'], dtype='object')

In [41]:
df.head()  # Print first 5 rows

Unnamed: 0,age,capital-gain,education,occupation,sex
0,39,2174,Bachelors,Adm-clerical,Male
1,50,111,Bachelors,Tech-support,Female
2,38,993,HS-grad,Sales,Female
3,38,125,Bachelors,Engineer,Male


In [42]:
df.tail()  # Print last 5 rows

Unnamed: 0,age,capital-gain,education,occupation,sex
0,39,2174,Bachelors,Adm-clerical,Male
1,50,111,Bachelors,Tech-support,Female
2,38,993,HS-grad,Sales,Female
3,38,125,Bachelors,Engineer,Male


In [43]:
df.age.value_counts()  #

38    2
39    1
50    1
Name: age, dtype: int64

In [44]:
type(df.age)

pandas.core.series.Series

In [45]:
df["age"]  # Dataframe style reference

0    39
1    50
2    38
3    38
Name: age, dtype: int64

In [46]:
df.age  # Object style reference

0    39
1    50
2    38
3    38
Name: age, dtype: int64

In [47]:
df.age.value_counts()  # Contingency table

38    2
39    1
50    1
Name: age, dtype: int64

In [48]:
df.age.value_counts(ascending = False)  # Contingency Table with desendgin

38    2
39    1
50    1
Name: age, dtype: int64

In [34]:
type(df.age)

pandas.core.series.Series

## Pandas Data Types

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
age             3 non-null int64
capital-gain    3 non-null int64
education       3 non-null object
occupation      3 non-null object
sex             3 non-null object
dtypes: int64(2), object(3)
memory usage: 200.0+ bytes


## Knowing Basic Stats

In [49]:
df.describe()

Unnamed: 0,age,capital-gain
count,4.0,4.0
mean,41.25,850.75
std,5.85235,973.852958
min,38.0,111.0
25%,38.0,121.5
50%,38.5,559.0
75%,41.75,1288.25
max,50.0,2174.0


In [50]:
df.cov()

Unnamed: 0,age,capital-gain
age,34.25,-2517.916667
capital-gain,-2517.916667,948389.583333


In [51]:
df.corr()

Unnamed: 0,age,capital-gain
age,1.0,-0.441792
capital-gain,-0.441792,1.0


## Dealing with Files

### Reading Data from File

UCI Machine Learning Repository: [Adult Data Set](https://archive.ics.uci.edu/ml/datasets/Adult)

In [58]:
adult = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None)

In [59]:
adult.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [60]:
columns = ['age', 'Work Class', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'Money Per Year']
adult = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', names=columns)

In [61]:
adult.head()

Unnamed: 0,age,Work Class,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Money Per Year
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### Writing Data to File

In [63]:
adult.to_json('data/adult.json')  # Write data in JSON
adult.to_excel('data/adult.xlsx')  # Write data in Excel

In [64]:
!ls data  # ! to sent command to ternimal

adult.json adult.xlsx


In [65]:
adult.to_csv('data/adult.csv')

In [66]:
!ls data

adult.csv  adult.json adult.xlsx


In [70]:
data = pd.read_csv("data/adult.csv", index_col = 0)  # Use column 0 as index column
data

Unnamed: 0,age,Work Class,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Money Per Year
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


## Dealing with Columns

### Renaming Columns

In [73]:
adult = pd.read_csv('data/adult.csv', index_col=0)

In [74]:
adult.head()

Unnamed: 0,age,Work Class,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Money Per Year
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [75]:
adult = adult.rename(columns={'Work Class': 'workclass'})

In [87]:
adult.workclass.head()
adult["workclass"][0]  # Reference to workclass first row (0)

' State-gov'

In [77]:
adult.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'Money Per Year'],
      dtype='object')

In [88]:
adult.columns = adult.columns.str.lower().str.replace(' ', '-')  # Chaining .lower and .replace. Replace all column name contain " " with "-",

In [89]:
adult.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'money-per-year'],
      dtype='object')

In [90]:
adult.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
money-per-year    32561 non-null object
dtypes: int64(6), object(9)
memory usage: 4.0+ MB


### Adding New Columns

In [91]:
adult['normalized-age'] = (adult.age - adult.age.mean()) / adult.age.std()

In [92]:
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,money-per-year,normalized-age
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0.03067
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0.837096
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,-0.042641
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,1.057031
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,-0.775756


### Removing Existing Columns

In [93]:
adult.drop('normalized-age', axis=1)  # Axis 0, rowise : Axis = 1, columnwise

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,money-per-year
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [94]:
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,money-per-year,normalized-age
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0.03067
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0.837096
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,-0.042641
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,1.057031
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,-0.775756


In [95]:
adult = adult.drop('normalized-age', axis=1)  # Keep function result in varible

In [96]:
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,money-per-year
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
