# Data Analysis with Pandas (1st Part)

**Outline:**

* [Intro to Pandas](#Intro-to-Pandas)
* [Pandas Data Structures](#Pandas-Data-Structures)
  * [Series](#Series)
  * [DataFrame](#DataFrame)
* [Pandas Data Types](#Pandas-Data-Types)
* [Knowing Basic Stats](#Knowing-Basic-Stats)
* [Dealing with Files](#Dealing-with-Files)
  * [Reading Data from File](#Reading-Data-from-File)
  * [Writing Data to File](#Writing-Data-to-File)
* [Dealing with Columns](#Dealing-with-Columns)
  * [Renaming Columns](#Renaming-Columns)
  * [Adding New Columns](#Adding-New-Columns)
  * [Removing Existing Columns](#Removing-Existing-Columns)

## Intro to Pandas

In [1]:
from IPython.core.display import HTML
HTML("<iframe src=http://pandas.pydata.org width=800 height=350></iframe>")

In [2]:
import pandas as pd

## Pandas Data Structures

### Series

In [3]:
series_data = pd.Series([113, 1463, 95, 33])
series_data

0     113
1    1463
2      95
3      33
dtype: int64

In [4]:
type(series_data)

pandas.core.series.Series

In [5]:
series_data = pd.Series({'a': 113, 'b': 1463, 'c': 95, 'd': 33})
series_data

a     113
b    1463
c      95
d      33
dtype: int64

In [6]:
series_data = pd.Series({'a': 113, 'b': 1463, 'c': 95, 'd': 33}, index=['b', 'c', 'd', 'e', 'f'])
series_data

b    1463.0
c      95.0
d      33.0
e       NaN
f       NaN
dtype: float64

In [7]:
series_data.isnull()

b    False
c    False
d    False
e     True
f     True
dtype: bool

In [8]:
series_data.index

Index(['b', 'c', 'd', 'e', 'f'], dtype='object')

In [9]:
series_data.values

array([ 1463.,    95.,    33.,    nan,    nan])

In [10]:
series_data + series_data

b    2926.0
c     190.0
d      66.0
e       NaN
f       NaN
dtype: float64

In [None]:
series_data.append(pd.Series([113, 1463, 95, 33]))

# DataFrame Most important

In [11]:
personal_data_dict = {
    'age': [39, 50, 38],
    'education': ['Bachelors', 'Bachelors', 'HS-grad'],
    'occupation': ['Adm-clerical', 'Tech-support', 'Sales'],
    'sex': ['Male', 'Female', 'Female'],
    'capital-gain': [2174, 111, 993]
}
df = pd.DataFrame(personal_data_dict)

In [12]:
type(df)

pandas.core.frame.DataFrame

In [15]:
df.shape

(3, 5)

In [16]:
df.index

RangeIndex(start=0, stop=3, step=1)

In [17]:
df.values

array([[39, 2174, 'Bachelors', 'Adm-clerical', 'Male'],
       [50, 111, 'Bachelors', 'Tech-support', 'Female'],
       [38, 993, 'HS-grad', 'Sales', 'Female']], dtype=object)

In [18]:
df.columns

Index(['age', 'capital-gain', 'education', 'occupation', 'sex'], dtype='object')

In [19]:
df.head()  # Print first 5 rows

Unnamed: 0,age,capital-gain,education,occupation,sex
0,39,2174,Bachelors,Adm-clerical,Male
1,50,111,Bachelors,Tech-support,Female
2,38,993,HS-grad,Sales,Female


In [20]:
df.tail()  # Print last 5 rows

Unnamed: 0,age,capital-gain,education,occupation,sex
0,39,2174,Bachelors,Adm-clerical,Male
1,50,111,Bachelors,Tech-support,Female
2,38,993,HS-grad,Sales,Female


In [21]:
df.age.value_counts()  #

39    1
50    1
38    1
Name: age, dtype: int64

In [22]:
type(df.age)

pandas.core.series.Series

In [23]:
df["age"]  # Dataframe style reference

0    39
1    50
2    38
Name: age, dtype: int64

In [24]:
df.age  # Object style reference

0    39
1    50
2    38
Name: age, dtype: int64

In [26]:
df.age.value_counts()  # Contingency table

39    1
50    1
38    1
Name: age, dtype: int64

In [30]:
df.age.value_counts(ascending = False)  # Contingency Table with desendgin

39    1
50    1
38    1
Name: age, dtype: int64

## Pandas Data Types

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
age             3 non-null int64
capital-gain    3 non-null int64
education       3 non-null object
occupation      3 non-null object
sex             3 non-null object
dtypes: int64(2), object(3)
memory usage: 200.0+ bytes


## Knowing Basic Stats

In [None]:
df.describe()

In [None]:
df.cov()

In [None]:
df.corr()

## Dealing with Files

### Reading Data from File

UCI Machine Learning Repository: [Adult Data Set](https://archive.ics.uci.edu/ml/datasets/Adult)

In [None]:
adult = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None)

In [None]:
adult.head()

In [None]:
columns = ['age', 'Work Class', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'Money Per Year']
adult = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', names=columns)

In [None]:
adult.head()

### Writing Data to File

In [None]:
adult.to_json('data/adult.json')

In [None]:
!ls data

In [None]:
adult.to_csv('data/adult.csv')

In [None]:
!ls data

## Dealing with Columns

### Renaming Columns

In [None]:
adult = pd.read_csv('data/adult.csv', index_col=0)

In [None]:
adult.head()

In [None]:
adult = adult.rename(columns={'Work Class': 'workclass'})

In [None]:
adult.head()

In [None]:
adult.columns

In [None]:
adult.columns = adult.columns.str.lower().str.replace(' ', '-')

In [None]:
adult.columns

In [None]:
adult.info()

### Adding New Columns

In [None]:
adult['normalized-age'] = (adult.age - adult.age.mean()) / adult.age.std()

In [None]:
adult.head()

### Removing Existing Columns

In [None]:
adult.drop('normalized-age', axis=1)

In [None]:
adult.head()

In [None]:
adult = adult.drop('normalized-age', axis=1)

In [None]:
adult.head()