# Pandas

In [124]:
import pandas as pd
import numpy as np
import math

## Creating a DataFrame

There are two ways of calling a column in a DataFrame as a series


In [None]:
#1 With single double brackets

df['order_status']

#2 As an attribute

orders.order_status


## Pandas Methods

### pd.Series.unique() Method

This method can be used with pandas **Series**, not DataFrames, and will return a list of all the unique values in the series. 

In [None]:
df['order_status'].unique()

In [None]:
df.order_status.unique()

### pd.to_datetime() Method

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

Note that in the example below, the information within a dataframe should be called as a Series (with single brackets), not as the column of a dataframe.

If a DataFrame is provided, the method expects minimally the following columns: "year", "month", "day".

In [97]:
# Create DataFrame Of String Format Dates

df = pd.DataFrame({'Date':["2022-08-23",
                           "2022-08-23",
                           "2022-08-23",
                           "2022-08-23"]
                  }
                 )

display(df)
print('\n')
display(df.info())

Unnamed: 0,Date
0,2022-08-23
1,2022-08-23
2,2022-08-23
3,2022-08-23




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Date    4 non-null      object
dtypes: object(1)
memory usage: 160.0+ bytes


None

#### Converting A Series To DateTime & Back

In [95]:
# Convert 'Date' Column with 'to_datetime()' function

df["Date"]= pd.to_datetime(df["Date"])

display(df.info())

# Convert back to string format with specified date format

df['Date'].dt.strftime('%Y-%m-%d')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    9 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 200.0 bytes


None

0    2022-08-23
1    2022-08-23
2    2022-08-23
3    2022-08-23
4    2022-08-23
5    2022-08-23
6    2022-08-23
7    2022-08-23
8    2022-08-23
Name: Date, dtype: object

#### Converting A Single Value To DateTime & Back

You can convert a single string to datetime and convert a datetime 'timestamp' back as below. You need to use this method when dealing with a single datapoint, as opposed to a series. 

In [77]:
str_date = "2022-08-23"
print(str_date)

date = pd.to_datetime(str_date)

print(date)
print(type(date))

str_date = str(date).split()
print(str_date)

print(str_date[0])

2022-08-23
2022-08-23 00:00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
['2022-08-23', '00:00:00']
2022-08-23


#### pd.DateOffset()

In [79]:
print(date)

print(date + pd.DateOffset(years = 1,
                          months = 1,
                          weeks = 1, 
                          days = 1, 
                          hours = 1, 
                          minutes = 1, 
                          seconds = 1)
     )


2022-08-23 00:00:00
2023-10-01 01:01:01


### pd.DataFrame.groupby() Method

Needs to have an action to do on the rows not being 'grouped by'.

In [None]:
df[['order_id','product_id']].groupby('order_id').count()

In [None]:
df[['order_id','product_id']].groupby('order_id').sum()

In [None]:
order_items[['order_id','seller_id']].groupby('order_id').nunique()

### pd.Series.agg() Method

**CAN BE USED IN GROUPBY() ON A DATAFRAME**

Allows you to perform different operations on columns that are grouped in a single 

In [None]:
df[['order_id','price','freight_value']].groupby('order_id').agg({'price': ['sum', 'mean'], 'freight_value': 'mean' })

If you need to join multiple string columns, you can use agg:

In [None]:
df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)

### pd.DataFrame.merge() Method

This can be written in two different ways, allowing for both chaining or nesting of multiple merges.

In [None]:
A_df.merge(B_df, how='left', on='order_id')

In [None]:
pd.merge(A_df, B_df, how='left', on='order_id')

In [None]:
df.query("order_status == 'delivered'")

### pd.Series.map() Method

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html

Below is how you would use the map() method to change certain values in a series to another, using a dictionary.

In [8]:
df = pd.DataFrame({'Stuff':['cat', 'dog', np.nan, 'rabbit']})

df.Stuff

0       cat
1       dog
2       NaN
3    rabbit
Name: Stuff, dtype: object

-- You can substitute stuff with a dictionary. **Note that anything not specified is converted to a NaN**

In [10]:
df.Stuff.map({
    'cat': 'kitten', 
    'dog': 'puppy'
})

0    kitten
1     puppy
2       NaN
3       NaN
Name: Stuff, dtype: object

-- It also accepts a function

In [12]:
df.Stuff.map('I am a {}'.format)

0       I am a cat
1       I am a dog
2       I am a nan
3    I am a rabbit
Name: Stuff, dtype: object

-- To avoid applying the function to missing values (and keep them as NaN) na_action='ignore' can be used:

In [13]:
df.Stuff.map('I am a {}'.format, na_action='ignore')

0       I am a cat
1       I am a dog
2              NaN
3    I am a rabbit
Name: Stuff, dtype: object

### pd.Series.apply() Method


First you defind a function. 

Then you can use the pd.Series.apply() method to apply the function of a column of a dataframe. 

In [None]:
# Define the function

def txt_rp(x):

    for punc in string.punctuation:
        x = x.replace(punc, '')    

    return x

# Use the apply method to apply it to a column of a DataFrame

df['clean_text'] = df['text'].apply(txt_rp)

### Using The Lambda Function (pd.DataFrame.apply() and pd.DataFrame.assign() Methods)


**Example 1**: Applying lambda function to single column using Dataframe.assign()

In the below example, the lambda function is applied to the ‘Total_Marks’ column and a new column ‘Percentage’ is formed with the help of it.

In [2]:
import pandas as pd
  
# creating and initializing a list
values= [['Rohan',455],['Elvish',250],['Deepak',495],
         ['Soni',400],['Radhika',350],['Vansh',450]]
 
# creating a pandas dataframe
df = pd.DataFrame(values,columns=['Name','Total_Marks'])
 
# Applying lambda function to find
# percentage of 'Total_Marks' column
# using df.assign()
df = df.assign(Percentage = lambda x: (x['Total_Marks'] /500 * 100))
 
# displaying the data frame
df

Unnamed: 0,Name,Total_Marks,Percentage
0,Rohan,455,91.0
1,Elvish,250,50.0
2,Deepak,495,99.0
3,Soni,400,80.0
4,Radhika,350,70.0
5,Vansh,450,90.0


**Example 2**: Applying lambda function to multiple columns using Dataframe.assign()

In the below example, lambda function is applied to 3 columns i.e ‘Field_1’, ‘Field_2’, and ‘Field_3’.

In [1]:
# importing pandas library
import pandas as pd
 
# creating and initializing a nested list
values_list = [[15, 2.5, 100], [20, 4.5, 50], [25, 5.2, 80],
               [45, 5.8, 48], [40, 6.3, 70], [41, 6.4, 90],
               [51, 2.3, 111]]
 
# creating a pandas dataframe
df = pd.DataFrame(values_list, columns=['Field_1', 'Field_2', 'Field_3'])
 
# Applying lambda function to find
# the product of 3 columns using
# df.assign()
df = df.assign(Product=lambda x: (x['Field_1'] * x['Field_2'] * x['Field_3']))
 
# printing dataframe
df

Unnamed: 0,Field_1,Field_2,Field_3,Product
0,15,2.5,100,3750.0
1,20,4.5,50,4500.0
2,25,5.2,80,10400.0
3,45,5.8,48,12528.0
4,40,6.3,70,17640.0
5,41,6.4,90,23616.0
6,51,2.3,111,13020.3


**Example 3**: Applying lambda function to single row using Dataframe.apply()

In the below example, a lambda function is applied to row starting with ‘d’ and hence square all values corresponds to it.

In [107]:
import pandas as pd
import numpy as np
 
# creating and initializing a nested list
values_list = [[15, 2.5, 100], [20, 4.5, 50], [25, 5.2, 80],
               [45, 5.8, 48], [40, 6.3, 70], [41, 6.4, 90],
               [51, 2.3, 111]]
 
# creating a pandas dataframe
df = pd.DataFrame(values_list, columns=['Field_1', 'Field_2', 'Field_3'],
                  index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
 
 
# Apply function numpy.square() to square
# the values of one row only i.e. row
# with index name 'd'
# df = df.apply(lambda x: np.square(x) if x.name == 'd' else x, axis=1)

df = df.apply(lambda x: np.square(x) if x >= 10 else x, axis=1)
 
# printing dataframe
df

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

**Example 4**: Applying lambda function to multiple rows using Dataframe.apply()

In the below example, a lambda function is applied to 3 rows starting with ‘a’, ‘e’, and ‘g’.

In [4]:
# importing pandas and numpylibraries
import pandas as pd
import numpy as np
 
# creating and initializing a nested list
values_list = [[15, 2.5, 100], [20, 4.5, 50], [25, 5.2, 80],
               [45, 5.8, 48], [40, 6.3, 70], [41, 6.4, 90],
               [51, 2.3, 111]]
 
# creating a pandas dataframe
df = pd.DataFrame(values_list, columns=['Field_1', 'Field_2', 'Field_3'],
                  index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
 
 
# Apply function numpy.square() to square
# the values of 3 rows only i.e. with row
# index name 'a', 'e' and 'g' only
df = df.apply(lambda x: np.square(x) if x.name in [
              'a', 'e', 'g'] else x, axis=1)
 
# printing dataframe
df

Unnamed: 0,Field_1,Field_2,Field_3
a,225.0,6.25,10000.0
b,20.0,4.5,50.0
c,25.0,5.2,80.0
d,45.0,5.8,48.0
e,1600.0,39.69,4900.0
f,41.0,6.4,90.0
g,2601.0,5.29,12321.0


**Example 5**: Applying the lambda function simultaneously to multiple columns and rows

In this example, a lambda function is applied to two rows and three columns. 

In [105]:
# importing pandas and numpylibraries
import pandas as pd
import numpy as np
 
# creating and initializing a nested list
values_list = [[1.5, 2.5, 10.0], [2.0, 4.5, 5.0], [2.5, 5.2, 8.0],
               [4.5, 5.8, 4.8], [4.0, 6.3, 70], [4.1, 6.4, 9.0],
               [5.1, 2.3, 11.1]]
 
# creating a pandas dataframe
df = pd.DataFrame(values_list, columns=['Field_1', 'Field_2', 'Field_3'],
                  index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
 
# Apply function numpy.square() to square
# the values of 2 rows only i.e. with row
# index name 'b' and 'f' only
df = df.apply(lambda x: np.square(x) if x.name in ['b', 'f'] else x, axis=1)
 
# Applying lambda function to find product of 3 columns
# i.e 'Field_1', 'Field_2' and 'Field_3'
df = df.assign(Product=lambda x: (x['Field_1'] * x['Field_2'] * x['Field_3']))
 
# printing dataframe
df

Unnamed: 0,Field_1,Field_2,Field_3,Product
a,1.5,2.5,10.0,37.5
b,4.0,20.25,25.0,2025.0
c,2.5,5.2,8.0,104.0
d,4.5,5.8,4.8,125.28
e,4.0,6.3,70.0,1764.0
f,16.81,40.96,81.0,55771.5456
g,5.1,2.3,11.1,130.203


**Example 6**: Creating a new columns using a Lambda function applied to an existing one

In [100]:
import pandas as pd
 
# dataframe
df = pd.DataFrame({'Name': ['John', 'Jack', 'Shri',
                            'Krishna', 'Smith', 'Tessa'],
                   'Maths': [5, 3, 9, 10, 6, 3]})
 
# Adding the result column
df['Result'] = df['Maths'].apply(lambda x: 'Pass' if x>=5 else 'Fail')
 
print(df)

      Name  Maths Result
0     John      5   Pass
1     Jack      3   Fail
2     Shri      9   Pass
3  Krishna     10   Pass
4    Smith      6   Pass
5    Tessa      3   Fail


Adding multiple if statements. **The below will not work**. You need to define the function separately and then use apply to apply it. See the pd.Series.apply() section before this one. 

In [102]:
df['Maths_spl Class'] = df["maths"].apply(lambda x: "No Need" if x>=5 elif x==5 "Hold" else "Need")

SyntaxError: invalid syntax (3022965822.py, line 1)

### pd.DataFrame.iterrows()

In [19]:
data = {
  "firstname": ["Sally", "Mary", "John"],
  "age": [50, 40, 30]
}

df = pd.DataFrame(data)

for index, row in df.iterrows():
    print(index)
    print(row["firstname"]) 

0
Sally
1
Mary
2
John


### You Can Also Loop Through A DataFrame With The zip() Function

This is much more computationaly efficient

Lists have to be the same length

In [23]:
print(df.firstname)
print(df.age)

0    Sally
1     Mary
2     John
Name: firstname, dtype: object
0    50
1    40
2    30
Name: age, dtype: int64


In [24]:
zip(df.firstname, df.age)

<zip at 0x123249780>

In [26]:
list(zip(df.firstname, df.age))

[('Sally', 50), ('Mary', 40), ('John', 30)]

### pd.DataFrame.

## How To Do Stuff

### How to add another row to a DataFrame

Below are two ways of adding new rows to a DataFrame

In [None]:
#add row to end of DataFrame

df.loc[len(df.index)] = [value1, value2, value3, ...]

In [None]:
#append rows of df2 to end of existing DataFrame

df = df.append(df2, ignore_index = True)

### How To Set The Index In A DataFrame

#### Single Index

In [5]:
# creating and initializing a nested list

students = [['jack', 34, 'Sydeny', 'Australia',85.96,400],
            ['Riti', 30, 'Delhi', 'India',95.20,750],
            ['Vansh', 31, 'Delhi', 'India',85.25,101],
            ['Nanyu', 32, 'Tokyo', 'Japan',74.21,900],
            ['Maychan', 16, 'New York', 'US',99.63,420],
            ['Mike', 17, 'las vegas', 'US',47.28,555]]
 
# Create a DataFrame object

df = pd.DataFrame(students,
                      columns=['Name', 'Age', 'City', 'Country','Agg_Marks','ID'],
                           index=['a', 'b', 'c', 'd', 'e', 'f'])

df.set_index('Agg_Marks', inplace = True)

df

Unnamed: 0_level_0,Name,Age,City,Country,ID
Agg_Marks,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
85.96,jack,34,Sydeny,Australia,400
95.2,Riti,30,Delhi,India,750
85.25,Vansh,31,Delhi,India,101
74.21,Nanyu,32,Tokyo,Japan,900
99.63,Maychan,16,New York,US,420
47.28,Mike,17,las vegas,US,555


#### Multi-Index

In [6]:
# creating and initializing a nested list

students = [['jack', 34, 'Sydeny', 'Australia',85.96,400],
            ['Riti', 30, 'Delhi', 'India',95.20,750],
            ['Vansh', 31, 'Delhi', 'India',85.25,101],
            ['Nanyu', 32, 'Tokyo', 'Japan',74.21,900],
            ['Maychan', 16, 'New York', 'US',99.63,420],
            ['Mike', 17, 'las vegas', 'US',47.28,555]]
 
# Create a DataFrame object

df = pd.DataFrame(students,
                      columns=['Name', 'Age', 'City', 'Country','Agg_Marks','ID'],
                           index=['a', 'b', 'c', 'd', 'e', 'f'])

df.set_index(['Name','City','ID'], inplace = True)

df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Age,Country,Agg_Marks
Name,City,ID,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
jack,Sydeny,400,34,Australia,85.96
Riti,Delhi,750,30,India,95.2
Vansh,Delhi,101,31,India,85.25
Nanyu,Tokyo,900,32,Japan,74.21
Maychan,New York,420,16,US,99.63
Mike,las vegas,555,17,US,47.28


In [13]:
# Adding Another Jack

df.loc['jack', 'New York', 666] = [99, 'US', 100.00]

In [17]:
# Searching The Index For 'Jack's

df.loc['jack']

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Country,Agg_Marks
City,ID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sydeny,400,34,Australia,85.96
New York,666,99,US,100.0


In [18]:
# Searching The Index For 'jack's from 'New York'

df.loc['jack', 'New York']

  df.loc['jack', 'New York']


Unnamed: 0_level_0,Age,Country,Agg_Marks
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
666,99,US,100.0


### How To Delete Rows

#### Drop Rows With Specific Index Labels Using pd.DataFrame.drop()

##### Single Non-numeric Index Label

In [28]:
students = [['jack', 34, 'Sydeny', 'Australia',85.96,400],
            ['Riti', 30, 'Delhi', 'India',95.20,750],
            ['Vansh', 31, 'Delhi', 'India',85.25,101],
            ['Nanyu', 32, 'Tokyo', 'Japan',74.21,900],
            ['Maychan', 16, 'New York', 'US',99.63,420],
            ['Mike', 17, 'las vegas', 'US',47.28,555]]
 
# Create a DataFrame object

df = pd.DataFrame(students,
                  columns=['Name', 'Age', 'City', 'Country','Agg_Marks','ID'],
                  index=['a', 'b', 'c', 'd', 'e', 'f']
                 )

df.drop('a', inplace = True)
df

Unnamed: 0,Name,Age,City,Country,Agg_Marks,ID
b,Riti,30,Delhi,India,95.2,750
c,Vansh,31,Delhi,India,85.25,101
d,Nanyu,32,Tokyo,Japan,74.21,900
e,Maychan,16,New York,US,99.63,420
f,Mike,17,las vegas,US,47.28,555


##### Multi Non-numeric Index Label

In [29]:
students = [['jack', 34, 'Sydeny', 'Australia',85.96,400],
            ['Riti', 30, 'Delhi', 'India',95.20,750],
            ['Vansh', 31, 'Delhi', 'India',85.25,101],
            ['Nanyu', 32, 'Tokyo', 'Japan',74.21,900],
            ['Maychan', 16, 'New York', 'US',99.63,420],
            ['Mike', 17, 'las vegas', 'US',47.28,555]]
 
# Create a DataFrame object

df = pd.DataFrame(students,
                  columns=['Name', 'Age', 'City', 'Country','Agg_Marks','ID'],
                  index=['a', 'b', 'c', 'd', 'e', 'f']
                 )

df.drop(['a','b','f'], inplace = True)
df

Unnamed: 0,Name,Age,City,Country,Agg_Marks,ID
c,Vansh,31,Delhi,India,85.25,101
d,Nanyu,32,Tokyo,Japan,74.21,900
e,Maychan,16,New York,US,99.63,420


##### Single Numeric Index Label

In [31]:
students = [['jack', 34, 'Sydeny', 'Australia',85.96,400],
            ['Riti', 30, 'Delhi', 'India',95.20,750],
            ['Vansh', 31, 'Delhi', 'India',85.25,101],
            ['Nanyu', 32, 'Tokyo', 'Japan',74.21,900],
            ['Maychan', 16, 'New York', 'US',99.63,420],
            ['Mike', 17, 'las vegas', 'US',47.28,555]]
 
# Create a DataFrame object

df = pd.DataFrame(students,
                  columns=['Name', 'Age', 'City', 'Country','Agg_Marks','ID']
                  )

df.drop(1, inplace = True)
df

Unnamed: 0,Name,Age,City,Country,Agg_Marks,ID
0,jack,34,Sydeny,Australia,85.96,400
2,Vansh,31,Delhi,India,85.25,101
3,Nanyu,32,Tokyo,Japan,74.21,900
4,Maychan,16,New York,US,99.63,420
5,Mike,17,las vegas,US,47.28,555


#### Drop Rows that Contain a Specific Value

##### Drop the specific value by using Operators

In [32]:
data = pd.DataFrame({
  
    "name": ['sravan', 'jyothika', 'harsha', 'ramya',
             'sravan', 'jyothika', 'harsha', 'ramya',
             'sravan', 'jyothika', 'harsha', 'ramya'],
    "subjects": ['java', 'java', 'java', 'python',
                 'python', 'python', 'html/php', 'html/php',
                 'html/php', 'php/js', 'php/js', 'php/js'],
    "marks": [98, 79, 89, 97, 82, 98, 90,
              87, 78, 89, 93, 94]
})
  
# display
print(data)
  
print("---------------")
  
# drop rows where value is 98
# by using not equal operator
print(data[data.marks != 98])
  
print("---------------")

        name  subjects  marks
0     sravan      java     98
1   jyothika      java     79
2     harsha      java     89
3      ramya    python     97
4     sravan    python     82
5   jyothika    python     98
6     harsha  html/php     90
7      ramya  html/php     87
8     sravan  html/php     78
9   jyothika    php/js     89
10    harsha    php/js     93
11     ramya    php/js     94
---------------
        name  subjects  marks
1   jyothika      java     79
2     harsha      java     89
3      ramya    python     97
4     sravan    python     82
6     harsha  html/php     90
7      ramya  html/php     87
8     sravan  html/php     78
9   jyothika    php/js     89
10    harsha    php/js     93
11     ramya    php/js     94
---------------


##### Drop Rows that Contain Values in a List

In [33]:
# create dataframe with 4 columns
data = pd.DataFrame({
  
    "name": ['sravan', 'jyothika', 'harsha', 'ramya',
             'sravan', 'jyothika', 'harsha', 'ramya', 
             'sravan', 'jyothika', 'harsha', 'ramya'],
    "subjects": ['java', 'java', 'java', 'python', 
                 'python', 'python', 'html/php', 
                 'html/php', 'html/php', 'php/js', 
                 'php/js', 'php/js'],
    "marks": [98, 79, 89, 97, 82, 98, 90, 87,
              78, 89, 93, 94]
})
  
# consider the list
list1 = [98, 82, 79]
  
# drop rows from above list
print(data[data.marks.isin(list1) == False])
  
print("---------------")
  
list2 = ['sravan', 'jyothika']
# drop rows from above list
print(data[data.name.isin(list2) == False])

        name  subjects  marks
2     harsha      java     89
3      ramya    python     97
6     harsha  html/php     90
7      ramya  html/php     87
8     sravan  html/php     78
9   jyothika    php/js     89
10    harsha    php/js     93
11     ramya    php/js     94
---------------
      name  subjects  marks
2   harsha      java     89
3    ramya    python     97
6   harsha  html/php     90
7    ramya  html/php     87
10  harsha    php/js     93
11   ramya    php/js     94


##### Drop rows that contain specific values in multiple columns

In [34]:
data = pd.DataFrame({
  
    "name": ['sravan', 'jyothika', 'harsha', 'ramya',
             'sravan', 'jyothika', 'harsha', 'ramya',
             'sravan', 'jyothika', 'harsha', 'ramya'],
    "subjects": ['java', 'java', 'java', 'python',
                 'python', 'python', 'html/php', 
                 'html/php', 'html/php', 'php/js', 
                 'php/js', 'php/js'],
    "marks": [98, 79, 89, 97, 82, 98, 90,
              87, 78, 89, 93, 94]
})
  
# drop specific values
# where marks is 98 and name is sravan
print(data[(data.marks != 98) & (data.name != 'sravan')])
  
print("------------------")
  
# drop specific values
# where marks is 98 or name is sravan
print(data[(data.marks != 98) | (data.name != 'sravan')])

        name  subjects  marks
1   jyothika      java     79
2     harsha      java     89
3      ramya    python     97
6     harsha  html/php     90
7      ramya  html/php     87
9   jyothika    php/js     89
10    harsha    php/js     93
11     ramya    php/js     94
------------------
        name  subjects  marks
1   jyothika      java     79
2     harsha      java     89
3      ramya    python     97
4     sravan    python     82
5   jyothika    python     98
6     harsha  html/php     90
7      ramya  html/php     87
8     sravan  html/php     78
9   jyothika    php/js     89
10    harsha    php/js     93
11     ramya    php/js     94


##### Drop Non-Numeric Values

In [None]:
df = df[df['y'].str.isnumeric()].copy()

#### Drop The Rows With NaNs

In [99]:
nums = {'Integers_1': [10, 15, 30, 40, 55, np.nan,
                       75, np.nan, 90, 150, np.nan],
        'Integers_2': [np.nan, 21, 22, 23, np.nan,
                       np.nan, 25, np.nan, 26, np.nan,
                       np.nan],
        'Integers_3': [21, np.nan, 22, np.nan, 23,
                       np.nan, np.nan, 25, 26, np.nan,
                       np.nan],
       }

df = pd.DataFrame(nums)

display(df)

# Drop Rows with Any NaN Values

print('\nDrop Rows with ANY NaN Values\n')
display(df.dropna())

# Drop Rows with ALL NaN Values 

print('\nDrop Rows with ALL NaN Values\n')
display(df.dropna(how = 'all'))

# Drop Rows Above a Certain Threshold
# Keeps the rows with number of NaN's below the threshhold

print('\nDrop Rows Above a Certain Threshold\n')
display(df.dropna(thresh=2))

# Drop Row with Nan Values in Specific Columns

print('\nDrop Row with Nan Values in Specific Columns\n')
display(df.dropna(subset = ['Integers_1','Integers_2']))

Unnamed: 0,Integers_1,Integers_2,Integers_3
0,10.0,,21.0
1,15.0,21.0,
2,30.0,22.0,22.0
3,40.0,23.0,
4,55.0,,23.0
5,,,
6,75.0,25.0,
7,,,25.0
8,90.0,26.0,26.0
9,150.0,,



Drop Rows with ANY NaN Values



Unnamed: 0,Integers_1,Integers_2,Integers_3
2,30.0,22.0,22.0
8,90.0,26.0,26.0



Drop Rows with ALL NaN Values



Unnamed: 0,Integers_1,Integers_2,Integers_3
0,10.0,,21.0
1,15.0,21.0,
2,30.0,22.0,22.0
3,40.0,23.0,
4,55.0,,23.0
6,75.0,25.0,
7,,,25.0
8,90.0,26.0,26.0
9,150.0,,



Drop Rows Above a Certain Threshold



Unnamed: 0,Integers_1,Integers_2,Integers_3
0,10.0,,21.0
1,15.0,21.0,
2,30.0,22.0,22.0
3,40.0,23.0,
4,55.0,,23.0
6,75.0,25.0,
8,90.0,26.0,26.0



Drop Row with Nan Values in Specific Columns



Unnamed: 0,Integers_1,Integers_2,Integers_3
1,15.0,21.0,
2,30.0,22.0,22.0
3,40.0,23.0,
6,75.0,25.0,
8,90.0,26.0,26.0


### How To Tell Differences / Similarities Between Two DataFrames

Creating Two DataFrames

In [35]:
# first dataframe
df1 = pd.DataFrame({
    'Age': ['20', '14', '56', '28', '10'],
    'Weight': [59, 29, 73, 56, 48]})
display(df1)
  
# second dataframe
df2 = pd.DataFrame({
    'Age': ['16', '20', '24', '40', '22'],
    'Weight': [55, 59, 73, 85, 56]})
display(df2)

Unnamed: 0,Age,Weight
0,20,59
1,14,29
2,56,73
3,28,56
4,10,48


Unnamed: 0,Age,Weight
0,16,55
1,20,59
2,24,73
3,40,85
4,22,56


#### Checking If Two Dataframes Are Exactly Same

In [36]:
df1.equals(df2)

False

We can also check for a particular column also.

In [37]:
df2['Age'].equals(df1['Age'])

False

#### Finding the common rows between two DataFrames

We can use either merge() function or concat() function. 

    The merge() function serves as the entry point for all standard database join operations between DataFrame objects. Merge function is similar to SQL inner join, we find the common rows between two dataframes. 

    The concat() function does all the heavy lifting of performing concatenation operations along with an axis od Pandas objects while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

##### Using merge function

In [39]:
df = df1.merge(df2, how = 'inner' ,indicator=False)
df

Unnamed: 0,Age,Weight
0,20,59


In [67]:
df = pd.merge(df1, df2, how = 'outer', indicator = True)
display(df)
display(df.query("_merge == 'both'"))

Unnamed: 0,Age,Weight,_merge
0,20,59,both
1,14,29,left_only
2,56,73,left_only
3,28,56,left_only
4,10,48,left_only
5,16,55,right_only
6,24,73,right_only
7,40,85,right_only
8,22,56,right_only


Unnamed: 0,Age,Weight,_merge
0,20,59,both


##### Using concat function

We add the second dataframe(df2) below the first dataframe(df1) by using concat function. Then we groupby the new dataframe using columns and then we see which rows have a count greater than 1. These are the common rows. This is how we can use-

In [53]:
df = pd.concat([df1, df2])
  
df = df.reset_index(drop=True)
  
df_group = df.groupby(list(df.columns))
  
idx = [x[0] for x in df_group.groups.values() if len(x) > 1]
df.reindex(idx)

Unnamed: 0,Age,Weight
0,20,59


In [71]:
df1 == df2

Unnamed: 0,Age,Weight
0,False,False
1,False,False
2,False,True
3,False,False
4,False,False


#### Finding the uncommon rows between two DataFrames

For uncommon rows, we can use concat function with a parameter drop_duplicate. 

##### Using concat

In [63]:
pd.concat([df1,df2]).drop_duplicates(keep=False)

Unnamed: 0,Age,Weight
1,14,29
2,56,73
3,28,56
4,10,48
0,16,55
2,24,73
3,40,85
4,22,56


##### Using merge

In [70]:
df = pd.merge(df1, df2, how = 'outer', indicator = True)
display(df)
display(df.query("_merge != 'both'"))
display(df.query("_merge == 'left_only'"))
display(df.query("_merge == 'right_only'"))

Unnamed: 0,Age,Weight,_merge
0,20,59,both
1,14,29,left_only
2,56,73,left_only
3,28,56,left_only
4,10,48,left_only
5,16,55,right_only
6,24,73,right_only
7,40,85,right_only
8,22,56,right_only


Unnamed: 0,Age,Weight,_merge
1,14,29,left_only
2,56,73,left_only
3,28,56,left_only
4,10,48,left_only
5,16,55,right_only
6,24,73,right_only
7,40,85,right_only
8,22,56,right_only


Unnamed: 0,Age,Weight,_merge
1,14,29,left_only
2,56,73,left_only
3,28,56,left_only
4,10,48,left_only


Unnamed: 0,Age,Weight,_merge
5,16,55,right_only
6,24,73,right_only
7,40,85,right_only
8,22,56,right_only


### How To Combine Two String Columns In A DataFrame

First, let’s create an example DataFrame

In [41]:
df = pd.DataFrame(
  [
    (1, '2017', 10, 'Q1'),
    (2, '2017', 20, 'Q2'),
    (3, '2016', 35, 'Q4'),
    (4, '2019', 25, 'Q2'),
    (5, '2020', 44, 'Q3'),
    (6, '2021', 51, 'Q3'),
  ], 
  columns=['colA', 'colB', 'colC', 'colD']
)

display(df)

Unnamed: 0,colA,colB,colC,colD
0,1,2017,10,Q1
1,2,2017,20,Q2
2,3,2016,35,Q4
3,4,2019,25,Q2
4,5,2020,44,Q3
5,6,2021,51,Q3


#### Concatenating string columns in small datasets

For relatively small datasets (up to 100–150 rows) you can use pandas.Series.str.cat() method that is used to concatenate strings in the Series using the specified separator (by default the separator is set to '').

In [42]:
df['colE'] = df.colB.str.cat(df.colD) 

display(df)

Unnamed: 0,colA,colB,colC,colD,colE
0,1,2017,10,Q1,2017Q1
1,2,2017,20,Q2,2017Q2
2,3,2016,35,Q4,2016Q4
3,4,2019,25,Q2,2019Q2
4,5,2020,44,Q3,2020Q3
5,6,2021,51,Q3,2021Q3


Now if we wanted to specify a separator that will be placed between the concatenated columns, then we simply need to pass sep argument:

In [43]:
df['colE'] = df.colB.str.cat(df.colD, sep='-')

df

Unnamed: 0,colA,colB,colC,colD,colE
0,1,2017,10,Q1,2017-Q1
1,2,2017,20,Q2,2017-Q2
2,3,2016,35,Q4,2016-Q4
3,4,2019,25,Q2,2019-Q2
4,5,2020,44,Q3,2020-Q3
5,6,2021,51,Q3,2021-Q3


Alternatively, you can also use a list comprehension which is a bit more verbose but slightly faster:

In [46]:
df['colE'] = [''.join(i) for i in zip(df['colB'], df['colD'])]

df

Unnamed: 0,colA,colB,colC,colD,colE
0,1,2017,10,Q1,2017Q1
1,2,2017,20,Q2,2017Q2
2,3,2016,35,Q4,2016Q4
3,4,2019,25,Q2,2019Q2
4,5,2020,44,Q3,2020Q3
5,6,2021,51,Q3,2021Q3


#### Concatenating string columns in larger datasets

Now if you are working with large datasets, the more efficient way to concatenate two columns is using the + operator.

In [47]:
df['colE'] = df['colB'] + df['colD']

df

Unnamed: 0,colA,colB,colC,colD,colE
0,1,2017,10,Q1,2017Q1
1,2,2017,20,Q2,2017Q2
2,3,2016,35,Q4,2016Q4
3,4,2019,25,Q2,2019Q2
4,5,2020,44,Q3,2020Q3
5,6,2021,51,Q3,2021Q3


If you want to include a separator then simply place it as a string in-between the two columns:

In [48]:
df['colE'] = df['colB'] + '-' + df['colD']

df

Unnamed: 0,colA,colB,colC,colD,colE
0,1,2017,10,Q1,2017-Q1
1,2,2017,20,Q2,2017-Q2
2,3,2016,35,Q4,2016-Q4
3,4,2019,25,Q2,2019-Q2
4,5,2020,44,Q3,2020-Q3
5,6,2021,51,Q3,2021-Q3


#### Concatenating string with non-string columns

In [49]:
df = pd.DataFrame(
  [
    (1, 2017, 10, 'Q1'),
    (2, 2017, 20, 'Q2'),
    (3, 2016, 35, 'Q4'),
    (4, 2019, 25, 'Q2'),
    (5, 2020, 44, 'Q3'),
    (6, 2021, 51, 'Q3'),
  ], 
  columns=['colA', 'colB', 'colC', 'colD']
)

print(df.dtypes)

colA     int64
colB     int64
colC     int64
colD    object
dtype: object


In this case, you can simply cast the column using pandas.DataFrame.astype() or map() methods.

In [50]:
# Option 1
df['colE'] = df.colB.astype(str).str.cat(df.colD)
display(df)

# Option 2
df['colE'] = df['colB'].astype(str) + '-' + df['colD']
display(df)

# Option 3
df['colE'] = [
    ''.join(i) for i in zip(df['colB'].map(str), df['colD'])
]
display(df)

Unnamed: 0,colA,colB,colC,colD,colE
0,1,2017,10,Q1,2017Q1
1,2,2017,20,Q2,2017Q2
2,3,2016,35,Q4,2016Q4
3,4,2019,25,Q2,2019Q2
4,5,2020,44,Q3,2020Q3
5,6,2021,51,Q3,2021Q3


Unnamed: 0,colA,colB,colC,colD,colE
0,1,2017,10,Q1,2017-Q1
1,2,2017,20,Q2,2017-Q2
2,3,2016,35,Q4,2016-Q4
3,4,2019,25,Q2,2019-Q2
4,5,2020,44,Q3,2020-Q3
5,6,2021,51,Q3,2021-Q3


Unnamed: 0,colA,colB,colC,colD,colE
0,1,2017,10,Q1,2017Q1
1,2,2017,20,Q2,2017Q2
2,3,2016,35,Q4,2016Q4
3,4,2019,25,Q2,2019Q2
4,5,2020,44,Q3,2020Q3
5,6,2021,51,Q3,2021Q3


### How to Filter A Pandas Dataframe By A List of Values

In [109]:
import numpy as np
import pandas as pd

df = pd.DataFrame({'date':pd.date_range(start='2021-12-01', periods=10, freq='MS'),
                   'country':['USA','India','Germany','France','Canada','Netherland',
                              'UK','Singapore', 'Australia', 'Canada'],
                   'a': np.random.randint(10, size=10),
                   'b': np.random.randint(10, size=10)})

df

Unnamed: 0,date,country,a,b
0,2021-12-01,USA,1,4
1,2022-01-01,India,5,9
2,2022-02-01,Germany,0,3
3,2022-03-01,France,2,7
4,2022-04-01,Canada,9,8
5,2022-05-01,Netherland,0,3
6,2022-06-01,UK,4,4
7,2022-07-01,Singapore,1,2
8,2022-08-01,Australia,6,3
9,2022-09-01,Canada,7,6


#### Filter By Using A Boolean Index

Below the '|' is the bolean operator for 'or'. You would use '&' for 'and'.

In [111]:
df[(df['country'] == 'Canada') | (df['country'] == 'USA')]

Unnamed: 0,date,country,a,b
0,2021-12-01,USA,1,4
4,2022-04-01,Canada,9,8
9,2022-09-01,Canada,7,6


#### Filter By Using Pandas isin() Method On A List

In [112]:
df[df['country'].isin(['Canada', 'USA', 'India'])]

Unnamed: 0,date,country,a,b
0,2021-12-01,USA,1,4
1,2022-01-01,India,5,9
4,2022-04-01,Canada,9,8
9,2022-09-01,Canada,7,6


In [113]:
df[df['a'].isin([2,3,4,5])]

Unnamed: 0,date,country,a,b
1,2022-01-01,India,5,9
3,2022-03-01,France,2,7
6,2022-06-01,UK,4,4


#### Filter By Using Pandas query() Method 

Finally, we can’t use the isin as a part of the String expression. We have to use the keyword in:

In [116]:
df.query("country in ['Canada','USA','India']")

Unnamed: 0,date,country,a,b
0,2021-12-01,USA,1,4
1,2022-01-01,India,5,9
4,2022-04-01,Canada,9,8
9,2022-09-01,Canada,7,6


In [117]:
df.query("a in [2,3,4,5]")

Unnamed: 0,date,country,a,b
1,2022-01-01,India,5,9
3,2022-03-01,France,2,7
6,2022-06-01,UK,4,4


### How to check for a 'float' NaN

In [137]:
test_1 = np.nan
test_2 = np.nan
test_3 = 4
test_4 = 'four'

print(test_1)
print(type(test_1))

nan
<class 'float'>


#### Using math.isnan()

In [138]:
#These two work
print(math.isnan(test_1))
print(math.isnan(test_3))

#But if a string will generate a TypeError
print(math.isnan(test_4))

True
False


TypeError: must be real number, not str

In [135]:
print(test_1 == np.nan)
print(test_1 == test_2)

False
False


#### Using pd.isna()

Nicer than math.isnan(), as a string will not throw off a TypeError

In [142]:
print(pd.isna(test_1))
print(pd.isna(test_3))
print(pd.isna(test_4))

True
False
False
