## Pandas basics
- pandas is a Python library that allows us to perform data analysis.
- pandas can deal with one and 2 dimensional data.
- For one-dimensional data, we use pandas.Series and for two dimensional data we use DataFrame.

- Each element in the Series has an index. By default the index is an integer starting with 0.
- We can add custom indexes using the "index" keyword.

In [55]:
import pandas as pd
import numpy as np

# Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print(s)

# Create a series with 4 random numbers and an index
s = pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd'])
print(s)

# Create a series from a dictionary
d = {'b' : 1, 'a' : 0, 'c' : 2}
s = pd.Series(d)
print(s)

# Create a series from a dictionary with an index
# notice the NaN values for the missing keys
d = {'b' : 1, 'a' : 0, 'c' : 2}
s = pd.Series(d, index=['b', 'c', 'd', 'a'])
print(s)


0    1.708562
1    0.915659
2    0.281823
3    0.734974
dtype: float64
a    0.780682
b    1.422925
c   -0.684985
d    0.367584
dtype: float64
b    1
a    0
c    2
dtype: int64
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64


## DataFrame
- DataFrames help us work with tabular data.

- The labels for rows is specified by "index" keyword, for columns it is specified by "columns" keyword.


In [56]:
# DataFrame with random numbers and column labels
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
print(df)

          A         B         C         D
0  0.512569  0.133839  0.204247  1.774620
1 -0.455914  0.355401 -0.292929  1.380157
2 -0.402433  0.438282  0.660208 -0.103434
3 -2.082001 -0.836116 -0.307990 -1.306658
4 -1.582906  0.064342  0.266590 -1.011040
5 -0.180062  0.830712  0.371717  0.901909
6  0.442381  0.660326 -1.000940 -0.598404
7 -0.970710 -0.378469  0.063490  0.633747
8  0.517520  0.703864 -0.508730  0.654946
9  0.366143  0.668599 -1.331724  0.074485


In [57]:
# DataFrame with random numbers and both row and column labels
df = pd.DataFrame(np.arange(10).reshape(5, 2),index=['a','b','c','d','e'], columns=['A', 'B'])
print(df)

# DataFrame from a scalar dictionary. 
# Here you cannot create a DataFrame from a scalar dictionary without passing an index
df = pd.DataFrame({'a':1,'b':3,'c':4}, index=[0,1,2])
print(df)

   A  B
a  0  1
b  2  3
c  4  5
d  6  7
e  8  9
   a  b  c
0  1  3  4
1  1  3  4
2  1  3  4


## Upcasting
- In DataFrames, if the data is of mixed type, the upcasting occurs at the per-column basis.

- This is similar to how you would format values in a spreadsheet.

In [58]:
# DataFrame upcasting
# Notice each column is upcasted to the most suitable type
df = pd.DataFrame({'A' : [1.,2.],'B':[2,3], 'C':['hello',1]}, index=[0,1])
print(df, '\n')
print(df.dtypes)

     A  B      C
0  1.0  2  hello
1  2.0  3      1 

A    float64
B      int64
C     object
dtype: object


## Appending data to DataFrames
- We can append data to the DataFrame, either using a Series or another DataFrame.

- append function now is being deprecated in favour of concat.



In [59]:
# concat data frames
df1 = pd.DataFrame(np.random.randn(4, 4))
print(df1, '\n')
df2 = pd.DataFrame(np.random.randn(3, 3))
print(df2, '\n')

# axis=0 means concat along the rows
df3 = pd.concat([df1, df2], axis=0)
print(df3, '\n')

# axis=1 means concat along the columns
df4 = pd.concat([df1, df2], axis=1)
print(df4, '\n')



          0         1         2         3
0  0.077604  0.384995 -0.790249  0.950822
1 -0.066026 -0.879447 -1.553207  0.208840
2 -0.127289 -1.151952  1.071695 -0.909430
3  0.193411 -1.024189  0.422127  0.099248 

          0         1         2
0  1.237672  1.677541  1.250144
1  1.286512 -0.335135  0.644909
2  0.442381  1.421472  1.706112 

          0         1         2         3
0  0.077604  0.384995 -0.790249  0.950822
1 -0.066026 -0.879447 -1.553207  0.208840
2 -0.127289 -1.151952  1.071695 -0.909430
3  0.193411 -1.024189  0.422127  0.099248
0  1.237672  1.677541  1.250144       NaN
1  1.286512 -0.335135  0.644909       NaN
2  0.442381  1.421472  1.706112       NaN 

          0         1         2         3         0         1         2
0  0.077604  0.384995 -0.790249  0.950822  1.237672  1.677541  1.250144
1 -0.066026 -0.879447 -1.553207  0.208840  1.286512 -0.335135  0.644909
2 -0.127289 -1.151952  1.071695 -0.909430  0.442381  1.421472  1.706112
3  0.193411 -1.024189  0.422127 

## Dropping data from DataFrame
- We use the "drop" function to remove rows or columns from a given DataFrame.

- Notice how the drop function returns a new DataFrame while keeping the original unmodified.

In [61]:
# drop data from a DataFrame
df = pd.DataFrame(np.random.randn(4, 4))
print(df, '\n')

# drop the first row
df1 = df.drop(df.index[0])
print(df1, '\n')

# drop the first and last row
df2 = df.drop(df.index[[0,-1]])
print(df2, '\n')

# drop the first column
df3 = df.drop(df.columns[0], axis=1)
print(df3, '\n')

# drop both the first row and last column
# notice the use of the drop method twice
df4 = df.drop(df.index[0]).drop(df.columns[-1], axis=1)
print(df4, '\n')



          0         1         2         3
0  0.294781  1.147258 -1.150713  0.266510
1  0.984694 -0.809137 -0.472871 -0.406626
2  0.970109  0.160603  1.256643  0.614948
3 -0.009581 -1.389422 -1.384330 -1.337385 

          0         1         2         3
1  0.984694 -0.809137 -0.472871 -0.406626
2  0.970109  0.160603  1.256643  0.614948
3 -0.009581 -1.389422 -1.384330 -1.337385 

          0         1         2         3
1  0.984694 -0.809137 -0.472871 -0.406626
2  0.970109  0.160603  1.256643  0.614948 

          1         2         3
0  1.147258 -1.150713  0.266510
1 -0.809137 -0.472871 -0.406626
2  0.160603  1.256643  0.614948
3 -1.389422 -1.384330 -1.337385 

          0         1         2
1  0.984694 -0.809137 -0.472871
2  0.970109  0.160603  1.256643
3 -0.009581 -1.389422 -1.384330 



## Accessing elements from a DataFrame

In [62]:
# accessing elements from a DataFrame
df = pd.DataFrame(np.random.randn(4, 4), columns=['A', 'B', 'C', 'D'])
print(df, '\n')

# access the first row
print(df.iloc[0], '\n')

# access the first column
print(df.iloc[:,0], '\n')

# access the first row and first column
print(df.iloc[0,0], '\n')

# access the first row and first two columns
print(df.iloc[0,0:2], '\n')

# access first and third row
print(df.iloc[[0,2]], '\n')


          A         B         C         D
0 -0.354083  1.130136  1.282740 -0.098677
1 -2.397465 -1.777443 -0.664287  0.848176
2 -1.106927 -2.609637  0.721658 -1.745914
3 -1.762365 -0.281043  0.646177 -1.298624 

A   -0.354083
B    1.130136
C    1.282740
D   -0.098677
Name: 0, dtype: float64 

0   -0.354083
1   -2.397465
2   -1.106927
3   -1.762365
Name: A, dtype: float64 

-0.3540830802400707 

A   -0.354083
B    1.130136
Name: 0, dtype: float64 

