# Introduction to Pandas

In this section of the course we will learn how to use pandas for data analysis. You can think of pandas as an extremely powerful version of Excel, with a lot more features.

- Pandas is an add-on library to Python.
- It let's us do more things with our code, specifically with dataframes.

### What is a dataframe?

Most of the time, the first thing we need to do in data analysis is to load in data.

When we bring spreadsheet-like data, into python, it is generally shaped like a rectangle (think of Microsoft Excel tables for example)
it is represented as what we call a dataframe object. It is very similar to a spreadsheet.

The rows in a dataframe are the collected observations.


![image1.PNG](attachment:image1.PNG)

In the dataframe, the columns are variables.

![variables.PNG](attachment:variables.PNG)

# Getting Started

Pandas is an add-on library to Python. 

It provides more capabilities to work with data frames.

To use pandas, you usually start with this line of code

In [2]:
import pandas as pd

# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

In [3]:
import pandas as pd
import numpy as np

# Modules in Python

Simply put, a module is a file consisting of Python code. It can define functions, classes, and variables, and can also include runnable code. Any Python file can be referenced as a module. A file containing Python code, for example: test.py, is called a module, and its name would be test.

# The import statement

To use the functionality present in any module, you have to import it into your current program. You need to use the import keyword along with the desired module name. When interpreter comes across an import statement, it imports the module to your current program. You can use the functions inside a module by using a dot(.) operator along with the module name.

# Renaming the imported module

You can rename the module you are importing, which can be useful in cases when you want to give a more meaningful name to the module or the module name is too large to use repeatedly. You can use the as keyword to rename it. 

# Creating Dataframe Input Various Ways

Constructing DataFrame from a dictionary.

### Create pandas DataFrame from list of dictionaries

In [204]:
sales = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},
         {'account': 'Alpha Co',  'Jan': 200, 'Feb': 210, 'Mar': 215},
         {'account': 'Blue Inc',  'Jan': 50,  'Feb': 90,  'Mar': 95 }]
sales_df = pd.DataFrame(sales)
sales_df

Unnamed: 0,account,Jan,Feb,Mar
0,Jones LLC,150,200,140
1,Alpha Co,200,210,215
2,Blue Inc,50,90,95


In [206]:
sales_df.dtypes

account    object
Jan         int64
Feb         int64
Mar         int64
dtype: object

###  Create pandas DataFrame from dictionary of lists

The dictionary keys represent the columns names and each list represents a column contents.

In [205]:
d = {'col1': [1,2], 'col2': [3, 4]}
df = pd.DataFrame(data = d)
df

Unnamed: 0,col1,col2
0,1,3
1,2,4


dtypes return the data types in the DataFrame.
This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the object dtype. 

In [180]:
df.dtypes

col1    int64
col2    int64
dtype: object

### Create pandas DataFrame from list of lists

Each inner list represents one row.

In [207]:
lst = [['tom', 25], ['krish', 30], 
       ['nick', 26], ['juli', 22]] 
df2 = pd.DataFrame(lst)

In [208]:
df2

Unnamed: 0,0,1
0,tom,25
1,krish,30
2,nick,26
3,juli,22


In [209]:
df2.columns = ['Name','Age']

In [210]:
df2

Unnamed: 0,Name,Age
0,tom,25
1,krish,30
2,nick,26
3,juli,22


In [211]:
list_of_lists = [
    ['Emma', 29, 'HR'],
    ['Oliver', 25, 'Finance'],
    ['Harry', 33, 'Marketing'],
    ['Sophia', 24, 'IT']]
df3 = pd.DataFrame(list_of_lists, columns = ['Name', 'Age', 'Department'])
df3

Unnamed: 0,Name,Age,Department
0,Emma,29,HR
1,Oliver,25,Finance
2,Harry,33,Marketing
3,Sophia,24,IT


In [7]:
from numpy.random import randn
np.random.seed(101)

In [21]:
df = pd.DataFrame(randn(5,4), index='A B C D E'.split(), columns='W X Y Z'.split())

In [22]:
df

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [10]:
df[['W', 'X', 'Y']]

Unnamed: 0,W,X,Y
A,2.70685,0.628133,0.907969
B,0.651118,-0.319318,-0.848077
C,-2.018168,0.740122,0.528813
D,0.188695,-0.758872,-0.933237
E,0.190794,1.978757,2.605967


In [11]:
# Pass a list of column names
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [220]:
# SQL Syntax (NOT RECOMMENDED!)
df.W

A    0.339572
B   -0.748595
C    1.894557
D    1.009741
E   -1.493926
Name: W, dtype: float64

DataFrame Columns are just Series

In [132]:
type(df['W'])

pandas.core.series.Series

**Creating a new column:**

In [12]:
df['new'] = df['W'] + df['Y']

In [13]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [14]:
df['new_2'] = df['W'] + df['new']
df

Unnamed: 0,W,X,Y,Z,new,new_2
A,2.70685,0.628133,0.907969,0.503826,3.614819,6.321669
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959,0.454159
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355,-3.507523
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542,-0.555847
E,0.190794,1.978757,2.605967,0.683509,2.796762,2.987556


**Removing Columns**

In [15]:
df

Unnamed: 0,W,X,Y,Z,new,new_2
A,2.70685,0.628133,0.907969,0.503826,3.614819,6.321669
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959,0.454159
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355,-3.507523
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542,-0.555847
E,0.190794,1.978757,2.605967,0.683509,2.796762,2.987556


In [16]:
df.drop('new', axis = 1)

Unnamed: 0,W,X,Y,Z,new_2
A,2.70685,0.628133,0.907969,0.503826,6.321669
B,0.651118,-0.319318,-0.848077,0.605965,0.454159
C,-2.018168,0.740122,0.528813,-0.589001,-3.507523
D,0.188695,-0.758872,-0.933237,0.955057,-0.555847
E,0.190794,1.978757,2.605967,0.683509,2.987556


In [231]:
df.drop('new_2', axis = 1)

Unnamed: 0,W,X,Y,Z,new
A,0.339572,-0.037705,0.136608,1.645435,0.47618
B,-0.748595,-0.065104,1.446365,0.808996,0.69777
C,1.894557,-0.505114,-0.693828,0.460784,1.200729
D,1.009741,0.303477,-2.062517,-1.735204,-1.052776
E,-1.493926,1.378229,-1.598018,-0.873802,-3.091944


In [136]:
# Not inplace unless specified!
df

Unnamed: 0,W,X,Y,Z,new
A,0.221491,-0.855196,1.54199,0.666319,1.763481
B,-0.538235,-0.568581,1.407338,0.641806,0.869104
C,-0.9051,-0.391157,1.028293,-1.972605,0.123193
D,-0.866885,0.720788,-1.223082,1.60678,-2.089967
E,-1.11571,-1.385379,-1.32966,0.04146,-2.44537


In [17]:
df.drop('new', axis = 1, inplace = True)

In [18]:
df.drop('new_2', axis = 1, inplace = True)

In [19]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


Can also drop rows this way:

In [238]:
df.drop('E', axis = 0)

Unnamed: 0,W,X,Y,Z
A,0.339572,-0.037705,0.136608,1.645435
B,-0.748595,-0.065104,1.446365,0.808996
C,1.894557,-0.505114,-0.693828,0.460784
D,1.009741,0.303477,-2.062517,-1.735204


In [239]:
df

Unnamed: 0,W,X,Y,Z
A,0.339572,-0.037705,0.136608,1.645435
B,-0.748595,-0.065104,1.446365,0.808996
C,1.894557,-0.505114,-0.693828,0.460784
D,1.009741,0.303477,-2.062517,-1.735204
E,-1.493926,1.378229,-1.598018,-0.873802


# Slicing

The general format to slice both rows and columns together looks like this:

```
data.loc['row name start': 'row name end', 'column name start': 'column name end']
```

In [240]:
df.loc['A']

W    0.339572
X   -0.037705
Y    0.136608
Z    1.645435
Name: A, dtype: float64

In [74]:
df.loc['A':'C', 'X':'Z']

Unnamed: 0,X,Y,Z
A,1.693723,-1.706086,-1.159119
B,0.390528,0.166905,0.184502
C,0.07296,0.638787,0.329646


In [241]:
df.loc['C': 'E', 'Y': 'Z']

Unnamed: 0,Y,Z
C,-0.693828,0.460784
D,-2.062517,-1.735204
E,-1.598018,-0.873802


If you want all the rows and certain columns

In [247]:
df.loc[:,'X':'Z']

Unnamed: 0,X,Y,Z
A,-0.037705,0.136608,1.645435
B,-0.065104,1.446365,0.808996
C,-0.505114,-0.693828,0.460784
D,0.303477,-2.062517,-1.735204
E,1.378229,-1.598018,-0.873802


If you want certain rows and all the columns

In [245]:
df.loc['B':'D', :]

Unnamed: 0,W,X,Y,Z
B,-0.748595,-0.065104,1.446365,0.808996
C,1.894557,-0.505114,-0.693828,0.460784
D,1.009741,0.303477,-2.062517,-1.735204


In [83]:
# Summary so far

- .loc[] is used to slice columns and rows by label and within an interval
- We always specify the row indexing first and then columns
- If we are not slicing any columns, but we are slicing rows we only need to specify the row labels.


If you only want certain rows and columns

In [248]:
df.loc[['A', 'D'], ['W', 'Y']]

Unnamed: 0,W,Y
A,0.339572,0.136608
D,1.009741,-2.062517


Or select based off of position instead of label 

In [250]:
df

Unnamed: 0,W,X,Y,Z
A,0.339572,-0.037705,0.136608,1.645435
B,-0.748595,-0.065104,1.446365,0.808996
C,1.894557,-0.505114,-0.693828,0.460784
D,1.009741,0.303477,-2.062517,-1.735204
E,-1.493926,1.378229,-1.598018,-0.873802


In [249]:
df.iloc[2]

W    1.894557
X   -0.505114
Y   -0.693828
Z    0.460784
Name: C, dtype: float64

In [251]:
df.iloc[2:4, 0:3]

Unnamed: 0,W,X,Y
C,1.894557,-0.505114,-0.693828
D,1.009741,0.303477,-2.062517


In [252]:
df.iloc[1:3, 3:4]

Unnamed: 0,Z
B,0.808996
C,0.460784


### Conditional Selection or Filtering

An important feature of pandas is conditional selection using bracket notation:

Filtering is one of the most frequent data manipulations you will do in data anlaysis. Filtering is often used when we are either 
trying to get rid of the unwanted rows or trying to analyze rows with a particular column name.

In [253]:
df

Unnamed: 0,W,X,Y,Z
A,0.339572,-0.037705,0.136608,1.645435
B,-0.748595,-0.065104,1.446365,0.808996
C,1.894557,-0.505114,-0.693828,0.460784
D,1.009741,0.303477,-2.062517,-1.735204
E,-1.493926,1.378229,-1.598018,-0.873802


In [254]:
df > 0

Unnamed: 0,W,X,Y,Z
A,True,False,True,True
B,False,False,True,True
C,True,False,False,True
D,True,True,False,False
E,False,True,False,False


In [20]:
df[df > 0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [258]:
df[df['W'] > 0]

Unnamed: 0,W,X,Y,Z
A,0.339572,-0.037705,0.136608,1.645435
C,1.894557,-0.505114,-0.693828,0.460784
D,1.009741,0.303477,-2.062517,-1.735204


In [259]:
df[df['W'] > 0]['Y']

A    0.136608
C   -0.693828
D   -2.062517
Name: Y, dtype: float64

In [260]:
df[df['W'] > 0][['Y','X']]

Unnamed: 0,Y,X
A,0.136608,-0.037705
C,-0.693828,-0.505114
D,-2.062517,0.303477


For two conditions you can use | and & with parenthesis:

In [261]:
df[(df['W'] > 0) & (df['X'] < 0)]

Unnamed: 0,W,X,Y,Z
A,0.339572,-0.037705,0.136608,1.645435
C,1.894557,-0.505114,-0.693828,0.460784


In [262]:
df[(df['W'] > 2) | (df['Y'] < 0)]

Unnamed: 0,W,X,Y,Z
C,1.894557,-0.505114,-0.693828,0.460784
D,1.009741,0.303477,-2.062517,-1.735204
E,-1.493926,1.378229,-1.598018,-0.873802


## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [263]:
df

Unnamed: 0,W,X,Y,Z
A,0.339572,-0.037705,0.136608,1.645435
B,-0.748595,-0.065104,1.446365,0.808996
C,1.894557,-0.505114,-0.693828,0.460784
D,1.009741,0.303477,-2.062517,-1.735204
E,-1.493926,1.378229,-1.598018,-0.873802


In [264]:
# Reset to default 0,1...n index
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,0.339572,-0.037705,0.136608,1.645435
1,B,-0.748595,-0.065104,1.446365,0.808996
2,C,1.894557,-0.505114,-0.693828,0.460784
3,D,1.009741,0.303477,-2.062517,-1.735204
4,E,-1.493926,1.378229,-1.598018,-0.873802


In [265]:
newind = 'CA NY WY OR CO'.split()

In [41]:
newind

['CA', 'NY', 'WY', 'OR', 'CO']

In [266]:
df['States'] = newind

In [267]:
df

Unnamed: 0,W,X,Y,Z,States
A,0.339572,-0.037705,0.136608,1.645435,CA
B,-0.748595,-0.065104,1.446365,0.808996,NY
C,1.894557,-0.505114,-0.693828,0.460784,WY
D,1.009741,0.303477,-2.062517,-1.735204,OR
E,-1.493926,1.378229,-1.598018,-0.873802,CO


In [268]:
df.set_index('States')

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,0.339572,-0.037705,0.136608,1.645435
NY,-0.748595,-0.065104,1.446365,0.808996
WY,1.894557,-0.505114,-0.693828,0.460784
OR,1.009741,0.303477,-2.062517,-1.735204
CO,-1.493926,1.378229,-1.598018,-0.873802


In [269]:
df

Unnamed: 0,W,X,Y,Z,States
A,0.339572,-0.037705,0.136608,1.645435,CA
B,-0.748595,-0.065104,1.446365,0.808996,NY
C,1.894557,-0.505114,-0.693828,0.460784,WY
D,1.009741,0.303477,-2.062517,-1.735204,OR
E,-1.493926,1.378229,-1.598018,-0.873802,CO


In [270]:
df.set_index('States', inplace=True)

In [271]:
df

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,0.339572,-0.037705,0.136608,1.645435
NY,-0.748595,-0.065104,1.446365,0.808996
WY,1.894557,-0.505114,-0.693828,0.460784
OR,1.009741,0.303477,-2.062517,-1.735204
CO,-1.493926,1.378229,-1.598018,-0.873802


# Read .csv file

In [24]:
salary = pd.read_csv("Salaries.csv")

In [25]:
# look at the head of dataset
salary.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


In [26]:
salary.head(2) # specify the output of the first 2 rows of the dataframe

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,


In [27]:
# look at the tail of the dataset
salary.tail()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
148649,148650,Roy I Tillery,Custodian,0.0,0.0,0.0,0.0,0.0,0.0,2014,,San Francisco,
148650,148651,Not provided,Not provided,,,,,0.0,0.0,2014,,San Francisco,
148651,148652,Not provided,Not provided,,,,,0.0,0.0,2014,,San Francisco,
148652,148653,Not provided,Not provided,,,,,0.0,0.0,2014,,San Francisco,
148653,148654,Joe Lopez,"Counselor, Log Cabin Ranch",0.0,0.0,-618.13,0.0,-618.13,-618.13,2014,,San Francisco,


In [28]:
salary.tail(2)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
148652,148653,Not provided,Not provided,,,,,0.0,0.0,2014,,San Francisco,
148653,148654,Joe Lopez,"Counselor, Log Cabin Ranch",0.0,0.0,-618.13,0.0,-618.13,-618.13,2014,,San Francisco,


In [29]:
salary['BasePay']

0         167411.18
1         155966.02
2         212739.13
3          77916.00
4         134401.60
            ...    
148649         0.00
148650          NaN
148651          NaN
148652          NaN
148653         0.00
Name: BasePay, Length: 148654, dtype: float64

In [30]:
salary['JobTitle']

0         GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY
1                        CAPTAIN III (POLICE DEPARTMENT)
2                        CAPTAIN III (POLICE DEPARTMENT)
3                   WIRE ROPE CABLE MAINTENANCE MECHANIC
4           DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)
                               ...                      
148649                                         Custodian
148650                                      Not provided
148651                                      Not provided
148652                                      Not provided
148653                        Counselor, Log Cabin Ranch
Name: JobTitle, Length: 148654, dtype: object

In [31]:
salary[['JobTitle', 'BasePay', 'OvertimePay', 'OtherPay']]

Unnamed: 0,JobTitle,BasePay,OvertimePay,OtherPay
0,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.00,400184.25
1,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38
2,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.60
3,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.00,56120.71,198306.90
4,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.60,9737.00,182234.59
...,...,...,...,...
148649,Custodian,0.00,0.00,0.00
148650,Not provided,,,
148651,Not provided,,,
148652,Not provided,,,


In [32]:
# pass a list of column names
salary[['JobTitle', 'BasePay', 'OvertimePay']]

Unnamed: 0,JobTitle,BasePay,OvertimePay
0,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.00
1,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88
2,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18
3,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.00,56120.71
4,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.60,9737.00
...,...,...,...
148649,Custodian,0.00,0.00
148650,Not provided,,
148651,Not provided,,
148652,Not provided,,


In [33]:
salary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Id                148654 non-null  int64  
 1   EmployeeName      148654 non-null  object 
 2   JobTitle          148654 non-null  object 
 3   BasePay           148045 non-null  float64
 4   OvertimePay       148650 non-null  float64
 5   OtherPay          148650 non-null  float64
 6   Benefits          112491 non-null  float64
 7   TotalPay          148654 non-null  float64
 8   TotalPayBenefits  148654 non-null  float64
 9   Year              148654 non-null  int64  
 10  Notes             0 non-null       float64
 11  Agency            148654 non-null  object 
 12  Status            0 non-null       float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB


Print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

In [35]:
df.describe()

Unnamed: 0,W,X,Y,Z
count,5.0,5.0,5.0,5.0
mean,0.072331,0.660979,-0.321135,0.367287
std,0.499191,1.121089,0.971818,1.121759
min,-0.497104,-0.75407,-1.706086,-1.159119
25%,-0.134841,0.07296,-0.943406,0.184502
50%,-0.116773,0.390528,0.166905,0.329646
75%,0.302665,1.693723,0.238127,0.484752
max,0.807706,1.901755,0.638787,1.996652


### Operations

In [303]:
df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [444, 555, 666, 444], 'col3': ['abc', 'def', 'ghi', 'xyz']})

In [304]:
df.head()

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


In [305]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [306]:
df.columns

Index(['col1', 'col2', 'col3'], dtype='object')

### Info on Unique Values

In [308]:
df['col2'].unique()

array([444, 555, 666], dtype=int64)

In [309]:
df['col2'].nunique()

3

In [310]:
df['col2'].value_counts()

444    2
555    1
666    1
Name: col2, dtype: int64

#### Sorting and Ordring a DataFrame

In [311]:
df

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


In [315]:
df.sort_values(by = 'col2', ascending = False, inplace = True)

In [316]:
df

Unnamed: 0,col1,col2,col3
2,3,666,ghi
1,2,555,def
0,1,444,abc
3,4,444,xyz


In [317]:
df = pd.DataFrame({'col1': [np.nan, 2, 3, 4], 'col2': [np.nan, 555, 666, 444], 'col3': ['abc', 'def', 'ghi', 'xyz']})

In [318]:
df

Unnamed: 0,col1,col2,col3
0,,,abc
1,2.0,555.0,def
2,3.0,666.0,ghi
3,4.0,444.0,xyz


### Find Null Values or Check for Null Values

In [319]:
df.isnull()

Unnamed: 0,col1,col2,col3
0,True,True,False
1,False,False,False
2,False,False,False
3,False,False,False


In [320]:
df.isnull().sum()

col1    1
col2    1
col3    0
dtype: int64

In [321]:
# Drop rows with Null Values

In [322]:
df.dropna()

Unnamed: 0,col1,col2,col3
1,2.0,555.0,def
2,3.0,666.0,ghi
3,4.0,444.0,xyz


In [326]:
df.fillna("missing")

Unnamed: 0,col1,col2,col3
0,missing,missing,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


In [327]:
df

Unnamed: 0,col1,col2,col3
0,,,abc
1,2.0,555.0,def
2,3.0,666.0,ghi
3,4.0,444.0,xyz
