# Pandas basic commands - A quick look

## This notebook is created for a quick peek at some commonly used pandas commands while practicing Data Science. However, this notebook covers some basic commands only.

#### Importing the numpy and pandas library

In [2]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame


#### Importing data copied to clipboard

In [5]:
ipl_data = pd.read_clipboard()
ipl_data

Unnamed: 0,Value,Entity,Unnamed: 3,Notes
0,0.8902,Golden State Warriors,2015-16,73-9
1,0.878,Chicago Bulls,1995-96,72-10
2,0.8415,Los Angeles Lakers,1971-72,69-13
3,0.8415,Chicago Bulls,1996-97,69-13


#### Importing the data from CSV File

In [6]:
california_housing = pd.read_csv("cal-house.csv")
california_housing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
5,-114.58,33.63,29.0,1387.0,236.0,671.0,239.0,3.3438,74000.0
6,-114.58,33.61,25.0,2907.0,680.0,1841.0,633.0,2.6768,82400.0
7,-114.59,34.83,41.0,812.0,168.0,375.0,158.0,1.7083,48500.0
8,-114.59,33.61,34.0,4789.0,1175.0,3134.0,1056.0,2.1782,58400.0
9,-114.60,34.83,46.0,1497.0,309.0,787.0,271.0,2.1908,48100.0


#### Retrieve first few rows from the table

In [7]:
california_housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


#### Retrieve statistical information of the data

In [8]:
california_housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.562108,35.625225,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578,207300.912353
std,2.005166,2.13734,12.586937,2179.947071,421.499452,1147.852959,384.520841,1.908157,115983.764387
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.79,33.93,18.0,1462.0,297.0,790.0,282.0,2.566375,119400.0
50%,-118.49,34.25,29.0,2127.0,434.0,1167.0,409.0,3.5446,180400.0
75%,-118.0,37.72,37.0,3151.25,648.25,1721.0,605.25,4.767,265000.0
max,-114.31,41.95,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500001.0


The describe function computes and displays the count of values, its' mean and standard deviation. It also calculates the statistical five numbers for the data which are namely: minimum value, the 25th Percentile, the median, the 75th Percentile and the maximum value. Thus, the describe function is quite handy to check statistical measures for the data quickly for making statistical decisions while modelling.

####    

### Concepts of Series' and Dataframes.

#### Series:
A series is one-dimensional array of similar data. The data values in the array are contextually similar to each other and represent values of a measure for different entities. The values in a series are indexed.

#### Dataframe:
A dataframe is basically many heterogenous series' collectively placed together in a tabular format to describe a research or data regarding something. For example: A simple table representing marks of each subject of all students in class is referred as a dataframe. A dataframe is multi-dimensional, indexed array of data with multiple columns in a dataframe.

####    

#### Creating a Series in pandas

In [10]:
series1 = Series(np.array(['India','USA','China','France','Russia']))
series1

0     India
1       USA
2     China
3    France
4    Russia
dtype: object

Notice that the values in the array when converted to a series using the Series function of pandas library, the values of array are arranged and have an index.

#### Converting a list to a Series in pandas

In [12]:
list1 = list(["A","B","C","D","E"])
series1 = Series(list1)
series1

0    A
1    B
2    C
3    D
4    E
dtype: object

#### Creating a Dataframe in pandas

In [13]:
df1 = DataFrame([[1,2,3,4,5],[6,7,8,9,0],[0,9,8,7,6],[5,4,3,2,1]])
df1

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,5
1,6,7,8,9,0
2,0,9,8,7,6
3,5,4,3,2,1


#### Creating a Dataframe of one-dimensional array by conversion to matrix

In [15]:
array1 = np.arange(25)
array1 = array1.reshape((5,5))
df1 = DataFrame(array1);
df1

Unnamed: 0,0,1,2,3,4
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19
4,20,21,22,23,24


#### Customizing indices in a Series/Dataframe

In [16]:
series1 = Series(np.array(['India','USA','China','France','Russia']),index=['A','B','C','D','E'])
df1 = DataFrame(array1,index=['A','B','C','D','E'])
series1

A     India
B       USA
C     China
D    France
E    Russia
dtype: object

In [17]:
df1

Unnamed: 0,0,1,2,3,4
A,0,1,2,3,4
B,5,6,7,8,9
C,10,11,12,13,14
D,15,16,17,18,19
E,20,21,22,23,24


 #### Customizing column names in a Dataframe

In [18]:
df1 = DataFrame(array1,index=['A','B','C','D','E'],columns=['a','b','c','d','e'])
df1

Unnamed: 0,a,b,c,d,e
A,0,1,2,3,4
B,5,6,7,8,9
C,10,11,12,13,14
D,15,16,17,18,19
E,20,21,22,23,24


 #### Re-Indexing in Series/Dataframes

In [20]:
series1.reindex(index=['z','y','x','w','v'])

z    NaN
y    NaN
x    NaN
w    NaN
v    NaN
dtype: object

Note that the new indices are introduced with default value of null. If you want to set a custom default value, use parameter fill_value as shown below:

In [22]:
series1.reindex(index=['z','y','x','w'],fill_value=23)

z    23
y    23
x    23
w    23
dtype: object

In [23]:
df1 = df1.reindex(index=['A','B','C','D','E','F'],columns=['a','b','c','d','e','f'], fill_value=0)
df1

Unnamed: 0,a,b,c,d,e,f
A,0,1,2,3,4,0
B,5,6,7,8,9,0
C,10,11,12,13,14,0
D,15,16,17,18,19,0
E,20,21,22,23,24,0
F,0,0,0,0,0,0


#### Hierarchical Indexing in Series/Dataframes

In [27]:
series1 = Series([1,2,3,4,5,6,7,8,9],index=[[1,1,1,2,2,2,3,3,3],['a','b','c','a','b','c','a','b','c']])
series1

1  a    1
   b    2
   c    3
2  a    4
   b    5
   c    6
3  a    7
   b    8
   c    9
dtype: int64

In [31]:
df1 = DataFrame(np.arange(16).reshape((4,4)),index=[[1,1,2,2],['a','b','a','b']],columns=[["Alpha","Alpha","Beta","Beta"],["alpha","beta","alpha","beta"]])
df1

Unnamed: 0_level_0,Unnamed: 1_level_0,Alpha,Alpha,Beta,Beta
Unnamed: 0_level_1,Unnamed: 1_level_1,alpha,beta,alpha,beta
1,a,0,1,2,3
1,b,4,5,6,7
2,a,8,9,10,11
2,b,12,13,14,15


#### Retrieving a row from the dataframe

In [67]:
california_housing.iloc[5]

longitude              -114.5800
latitude                 33.6300
housing_median_age       29.0000
total_rooms            1387.0000
total_bedrooms          236.0000
population              671.0000
households              239.0000
median_income             3.3438
median_house_value    74000.0000
Name: 5, dtype: float64

#### Joining two series'

In [38]:
ser1 = [1,2,3,4,5]
ser2 = [6,7,8,9,0]
ser1 = ser1+ser2
ser1

[1, 2, 3, 4, 5, 6, 7, 8, 9, 0]

#### Dropping values from Series/Dataframes

In [28]:
seriesA = Series(np.array([1,2,3,4,5]),index=['a','b','c','d','e'])
seriesA

a    1
b    2
c    3
d    4
e    5
dtype: int32

In [29]:
temp = seriesA.drop('c')
temp

a    1
b    2
d    4
e    5
dtype: int32

#### Dropping columns in a Dataframe

In [38]:
df = pd.read_clipboard()
df

Unnamed: 0,Value,Entity,Unnamed: 2
0,73,Golden State Warriors,2015-16
1,72,Chicago Bulls,1995-96
2,69,Los Angeles Lakers,1971-72
3,69,Chicago Bulls,1996-97


In [39]:
df1 = df.drop('Value',axis=1)
df1    

Unnamed: 0,Entity,Unnamed: 2
0,Golden State Warriors,2015-16
1,Chicago Bulls,1995-96
2,Los Angeles Lakers,1971-72
3,Chicago Bulls,1996-97


#### Sorting and Ranking in pandas

In [46]:
seriesA = Series(np.random.randn(6))
seriesA

0   -1.618595
1   -1.035864
2    1.142488
3    1.555067
4   -0.526914
5    1.068756
dtype: float64

In [45]:
seriesA.rank()

0    5.0
1    2.0
2    4.0
3    6.0
4    3.0
5    1.0
dtype: float64

#### Sorting values by index

In [68]:
indexs = [5,3,2,1,0,4]
ss = seriesA.reindex(index=indexs)
ss

5    1.068756
3    1.555067
2    1.142488
1   -1.035864
0   -1.618595
4   -0.526914
dtype: float64

In [53]:
seriesA.sort_index()

0   -1.618595
1   -1.035864
2    1.142488
3    1.555067
4   -0.526914
5    1.068756
dtype: float64

#### Sorting series by values

In [55]:
seriesA.sort_values()

0   -1.618595
1   -1.035864
4   -0.526914
5    1.068756
2    1.142488
3    1.555067
dtype: float64

### Handling Missing values in Series/Dataframes

#### Checking if the data value is equal to null

In [58]:
npn = np.nan
df = DataFrame([[0,npn,1,2,npn],[3,88,23,0,67],[npn,npn,npn,89,npn],[55,44,33,54,7878],[9,8,7,npn,32]])
df

Unnamed: 0,0,1,2,3,4
0,0.0,,1.0,2.0,
1,3.0,88.0,23.0,0.0,67.0
2,,,,89.0,
3,55.0,44.0,33.0,54.0,7878.0
4,9.0,8.0,7.0,,32.0


In [59]:
df.isnull()

Unnamed: 0,0,1,2,3,4
0,False,True,False,False,True
1,False,False,False,False,False
2,True,True,True,False,True
3,False,False,False,False,False
4,False,False,False,True,False


#### Drop values where data is missing

In [60]:
df.dropna()

Unnamed: 0,0,1,2,3,4
1,3.0,88.0,23.0,0.0,67.0
3,55.0,44.0,33.0,54.0,7878.0


The dropna function drops all the rows in the dataframe where at least one value is null. If we want to drop the rows only if all the values are null, then use "how" attribute

In [63]:
npn = np.nan
df = DataFrame([[npn,npn,npn,npn,npn],[3,88,23,0,67],[npn,npn,npn,npn,npn],[55,44,33,54,7878],[9,8,7,npn,32]])
df

Unnamed: 0,0,1,2,3,4
0,,,,,
1,3.0,88.0,23.0,0.0,67.0
2,,,,,
3,55.0,44.0,33.0,54.0,7878.0
4,9.0,8.0,7.0,,32.0


In [64]:
 df.dropna(how="all")

Unnamed: 0,0,1,2,3,4
1,3.0,88.0,23.0,0.0,67.0
3,55.0,44.0,33.0,54.0,7878.0
4,9.0,8.0,7.0,,32.0


#### Drop columns where values are null

In [65]:
df

Unnamed: 0,0,1,2,3,4
0,,,,,
1,3.0,88.0,23.0,0.0,67.0
2,,,,,
3,55.0,44.0,33.0,54.0,7878.0
4,9.0,8.0,7.0,,32.0


In [66]:
 df.dropna(axis=1)

0
1
2
3
4


Note that since all the columns contain atleast one null value, all the columns are deleted hence resulting in empty dataframe.

#   

#   

## Thanks for Reading.