##  Series

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

In [3]:
import numpy as np
import pandas as pd

#### 1. Creating Series using Python Dictionary

In [4]:
dict= {"India":100,"Japan":90,"China":80}
pd.Series(dict)

India    100
Japan     90
China     80
dtype: int64

#### 2. From a single scalar value

In [10]:
my_data = [10,20,30]
pd.Series(data=my_data)

0    10
1    20
2    30
dtype: int64

In [6]:
labels = ["India","Japan","China"]
my_data = [10,20,30]
pd.Series(my_data,labels)

India    10
Japan    20
China    30
dtype: int64

#### 3. From a NumPy array 

In [11]:
labels = ["India","Japan","China"]
my_data = [10,20,30]
arr = np.array(my_data)
pd.Series(arr,labels)

India    10
Japan    20
China    30
dtype: int32

### A pandas Series can hold a variety of object types:

In [None]:
# Series can also hold functions
pd.Series(data=[sum,print,len])

### Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

In [7]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])         

In [8]:
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])      

In [9]:
ser1['USA']

1

#### Operations on Series

Operations are then also done based off of index:

In [12]:
ser1 + ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

## DataFrames

In [None]:
import numpy as np
import pandas as pd

In [13]:
from numpy.random import randn
np.random.seed(101)

# data_frame = pd.DataFrame(randn(5,4))
index=['A','B','C','D','E']
colmns = ['C1','C2','C3','C4']
data_frame = pd.DataFrame(randn(5,4),index,colmns)
data_frame

Unnamed: 0,C1,C2,C3,C4
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [None]:
np.random.seed(101)

In [None]:
randn(5,4)

### Data Frame Operations

#### Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

##### Accessing Single column

In [None]:
data_frame['C1']

In [None]:
type(data_frame)

In [None]:
# SQL Syntax (NOT RECOMMENDED!)
data_frame.C1

##### Accessing Multiple Columns

In [None]:
data_frame[['C1','C2']]

#### Example: Creating a new column:

In [None]:
data_frame['new'] = data_frame['C1'] + data_frame['C2']
data_frame

#### Example: Dropping a column:

In [None]:
data_frame.drop('new',axis=1)

In [None]:
# Not inplace unless specified!
data_frame

In [None]:
data_frame.drop("new",axis=1,inplace=True)

In [None]:
data_frame

#### Example: Dropping rows

In [None]:
# data_frame.drop('E',axis=0)
data_frame.drop('E',inplace=True)

In [None]:
data_frame

In [None]:
data_frame.shape

#### Example: Selecting Rows

There are 2 ways

2. Or select based off of position instead of label

In [None]:
data_frame.loc['A']

In [None]:
data_frame.iloc[0]

#### Selecting subset of rows and columns

In [None]:
data_frame

In [None]:
data_frame.loc['C','C3']

In [None]:
data_frame.loc['D',['C3','C4']]

In [None]:
data_frame.loc[['A','B'],['C1','C2']]

#### Conditional formatting

In [None]:
data_frame

In [None]:
data_frame>0

In [None]:
booleanDF = data_frame>0

In [None]:
booleanDF

In [None]:
data_frame[booleanDF]
# data_frame[data_frame>0]

In [None]:
data_frame[data_frame>0]

In [None]:
data_frame['C1']

In [None]:
data_frame['C1']>0

In [None]:
data_frame[data_frame['C1']>0]

In [None]:
result_df = data_frame[data_frame['C1']>0]
result_df['C1']



In [None]:
data_frame[data_frame['C1']>0][['C1','C2']]

In [None]:
# For two conditions you can use | and & with parenthesis:
data_frame

In [None]:
cond1 = (data_frame['C2']>0) & (data_frame['C3']>0)
cond1

In [None]:
my_cols = ['C1','C2']
data_frame[cond1][my_cols]

In [None]:
my_cols = ['C1','C2']
data_frame[(data_frame['C2']>0) & (data_frame['C3']>0) ][my_cols]

In [None]:
data_frame

In [None]:
data_frame[(data_frame['C2']>0.7) | (data_frame['C3']>0.7) ][['C1']]

## Indexes

## Handling Missing Data

In [None]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})

In [None]:
df

In [None]:
df.dropna()

In [None]:
df.dropna(axis=1)

In [None]:
df.dropna(thresh=2)

In [None]:
df.fillna(value='NULL IN SOURCE')

In [None]:
df['A'].fillna(value=df['A'].mean())

## Groupby

In [None]:
import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)

In [None]:
df

In [None]:
df.groupby('Company')

In [None]:
by_comp = df.groupby("Company")

In [None]:
by_comp.mean()

In [None]:
df.groupby('Company').mean()

In [None]:
df.groupby('Company').sum()

In [None]:
by_comp.std()

In [None]:
by_comp.min()

In [None]:
by_comp.max()

In [None]:
by_comp.count()

In [None]:
by_comp.describe()

In [None]:
by_comp.describe().transpose()

In [None]:
by_comp.describe().transpose()['GOOG']