### The data structure of pandas

1. Series
2. DataFrame 
3. Panel

### Series
Series is a one-dimensional array, which can hold any type of data, such as integers, floats, strings, and Python objects too.

In [1]:
import pandas as pd
import numpy as np
pd.Series(np.random.randn(5))

0    1.026744
1    0.710258
2   -1.702517
3    0.359361
4    2.818192
dtype: float64

In [2]:
# The index of the series can be customized
pd.Series(np.random.randn(5), index=['a','b','c','d','e'])

a    0.068467
b    0.403448
c    0.303408
d   -0.361557
e    0.650737
dtype: float64

In [3]:
# A series can be derived from a Python dict too
d = {'A': 10, 'B': 20, 'C': 30}
d

{'A': 10, 'B': 20, 'C': 30}

In [4]:
pd.Series(d)

A    10
B    20
C    30
dtype: int64

### Dataframe
DataFrame is a 2D data structure with columns that can be of different datatypes.

A DataFrame can be formed from the following data structures:
1. A NumPy array
2. Lists
3. Dicts
4. Series
5. A 2D NumPy array

In [5]:
# A dataframe can be created from a dict of series
d = {'c1': pd.Series(['A','B','C']), 'c2': pd.Series([1,2., 3., 4.])}
df = pd.DataFrame(d)
df

Unnamed: 0,c1,c2
0,A,1.0
1,B,2.0
2,C,3.0
3,,4.0


In [6]:
# A dataframe can be created using a dict of list 
d = {'c1': ['A','B','C','D'], 'c2': [1, 2.0, 3.0, 4.0]}
df = pd.DataFrame(d)
df

Unnamed: 0,c1,c2
0,A,1.0
1,B,2.0
2,C,3.0
3,D,4.0


### Panel
A Panel is a data structure that handles 3D data

In [7]:
d = {'Item1': pd.DataFrame(np.random.randn(4,3)), 'item2': pd.DataFrame(np.random.randn(4,2))}

In [8]:
pd.Panel(d)

Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  exec(code_obj, self.user_global_ns, self.user_ns)


<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

### Inserting and Exporting data

In [12]:
d = pd.read_csv('data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.csv')
d[0:5]['AREA NAME']

0    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
1    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
2    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
3                        COHOES CITY SCHOOL DISTRICT
4                        COHOES CITY SCHOOL DISTRICT
Name: AREA NAME, dtype: object

In [15]:
d = {'c1': pd.Series(['A','B','C']), 'c2': pd.Series([1,2,3,4])}
df = pd.DataFrame(d)
df.to_csv('sample_data.csv')

### Checking missing data

In [29]:
df = pd.read_csv('data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.csv')
df['Location 1'].isnull()

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
3240    False
3241    False
3242    False
3243    False
3244    False
3245    False
3246    False
3247    False
3248    False
3249    False
3250    False
3251    False
3252    False
3253    False
3254    False
3255    False
3256    False
3257    False
3258    False
3259    False
3260    False
3261    False
3262    False
3263    False
3264    False
3265    False
3266    False
3267     True
3268     True
3269     True
Name: Location 1, Length: 3270, dtype: bool

In [30]:
df['Location 1'].isnull().value_counts()

False    3246
True       24
Name: Location 1, dtype: int64

In [32]:
# To remove the rows
df['Location 1'].dropna()


0       15 MOUNTAIN RD\nRAVENA, NY 12143\n(42.47227638...
1       15 MOUNTAIN RD\nRAVENA, NY 12143\n(42.47227638...
2       15 MOUNTAIN RD\nRAVENA, NY 12143\n(42.47227638...
3       7 BEVAN ST\nCOHOES, NY 12047\n(42.771285452000...
4       7 BEVAN ST\nCOHOES, NY 12047\n(42.771285452000...
5       7 BEVAN ST\nCOHOES, NY 12047\n(42.771285452000...
6       102 LORALEE DR\nALBANY, NY 12205\n(42.73352407...
7       102 LORALEE DR\nALBANY, NY 12205\n(42.73352407...
8       102 LORALEE DR\nALBANY, NY 12205\n(42.73352407...
9       91 FIDDLERS LN\nLATHAM, NY 12110\n(42.72935391...
10      91 FIDDLERS LN\nLATHAM, NY 12110\n(42.72935391...
11      91 FIDDLERS LN\nLATHAM, NY 12110\n(42.72935391...
12      171 HUDSON AVE\nGREEN ISLAND, NY 12183\n(42.74...
13      171 HUDSON AVE\nGREEN ISLAND, NY 12183\n(42.74...
14      171 HUDSON AVE\nGREEN ISLAND, NY 12183\n(42.74...
15      8 SCHOOL RD\nGUILDERLAND CENTER, NY 12085\n(42...
16      8 SCHOOL RD\nGUILDERLAND CENTER, NY 12085\n(42...
17      8 SCHO

### Filling the missing data

In [34]:
df = pd.DataFrame(np.random.randn(5,3), index=['a0','a10','a20','a30','a40'], columns=['X','Y','Z'])
df

Unnamed: 0,X,Y,Z
a0,1.365116,-1.74736,-0.483754
a10,-1.153489,-0.682604,-2.44587
a20,-0.103668,0.758928,0.305022
a30,-0.106463,-0.519923,-0.121175
a40,-1.123002,0.926606,-0.393299


In [36]:
df2 = df.reindex(['a0','a1','a10','a11','a20','a21','a30','a31','a40','a41'])
df2

Unnamed: 0,X,Y,Z
a0,1.365116,-1.74736,-0.483754
a1,,,
a10,-1.153489,-0.682604,-2.44587
a11,,,
a20,-0.103668,0.758928,0.305022
a21,,,
a30,-0.106463,-0.519923,-0.121175
a31,,,
a40,-1.123002,0.926606,-0.393299
a41,,,


In [41]:
# Replace null values with zero
df2.fillna(0)

Unnamed: 0,X,Y,Z
a0,1.365116,-1.74736,-0.483754
a1,0.0,0.0,0.0
a10,-1.153489,-0.682604,-2.44587
a11,0.0,0.0,0.0
a20,-0.103668,0.758928,0.305022
a21,0.0,0.0,0.0
a30,-0.106463,-0.519923,-0.121175
a31,0.0,0.0,0.0
a40,-1.123002,0.926606,-0.393299
a41,0.0,0.0,0.0


In [43]:
# fill with forward propagation, 
# which means that the value previous to the null value in the column will be used to fill the null value
df2.fillna(method='pad')

Unnamed: 0,X,Y,Z
a0,1.365116,-1.74736,-0.483754
a1,1.365116,-1.74736,-0.483754
a10,-1.153489,-0.682604,-2.44587
a11,-1.153489,-0.682604,-2.44587
a20,-0.103668,0.758928,0.305022
a21,-0.103668,0.758928,0.305022
a30,-0.106463,-0.519923,-0.121175
a31,-0.106463,-0.519923,-0.121175
a40,-1.123002,0.926606,-0.393299
a41,-1.123002,0.926606,-0.393299


In [44]:
# fill with column mean
df2.fillna(df2.mean())

Unnamed: 0,X,Y,Z
a0,1.365116,-1.74736,-0.483754
a1,-0.224301,-0.25287,-0.627815
a10,-1.153489,-0.682604,-2.44587
a11,-0.224301,-0.25287,-0.627815
a20,-0.103668,0.758928,0.305022
a21,-0.224301,-0.25287,-0.627815
a30,-0.106463,-0.519923,-0.121175
a31,-0.224301,-0.25287,-0.627815
a40,-1.123002,0.926606,-0.393299
a41,-0.224301,-0.25287,-0.627815
