# **Getting Started with Pandas**

It's me **Nir Bahadur Raya**. It's February 27, 2023.

This notebook is a compilation of all the concepts related to Pandas that I learned and revised. The purpose of this notebook is to serve as a revision guide for me to review and solidify my understanding of Pandas. It can also serve as a quick reference guide for anyone looking to learn or refresh their knowledge of Pandas.

Pandas is a popular open-source data manipulation library for Python that provides powerful data structures for efficient data analysis and cleaning. It offers flexible tools to handle data in various formats such as CSV, Excel, SQL databases, and others. With Pandas, you can easily filter, transform, aggregate, and merge data to perform statistical analysis, machine learning, and other data-intensive tasks. 

**1. Installing Pandas**

Before we can start using Pandas, we need to install it. You can install Pandas using pip, the package installer for Python, by running the following command in your terminal:

In [1]:
# pip install pandas


Pandas is pre-installed in Google Colab, so you don't need to install it again using pip. You can import it in your Colab notebook and start using it right away. 

When you import Pandas, you can assign it an alias 'pd' so that you can refer to it using 'pd' instead of typing 'pandas' every time you want to use a function or an object from Pandas. This can be done by:

In [2]:
import pandas as pd

**2. Creating a Pandas DataFrame**

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table. You can create a DataFrame in Pandas using a variety of methods. Here are a few examples:

**Method 1: From a dictionary**
You can create a DataFrame from a dictionary, where the keys are the column names and the values are the data. For example:

In [3]:
data = {'name': ['John', 'Mary', 'Peter', 'Jane'], 'age': [25, 30, 27, 21], 'city': ['New York', 'Paris', 'London', 'Sydney']}

df = pd.DataFrame(data)

print(df)


    name  age      city
0   John   25  New York
1   Mary   30     Paris
2  Peter   27    London
3   Jane   21    Sydney


**Method 2: From a list of lists**

You can also create a DataFrame from a list of lists. Each inner list represents a row in the DataFrame. For example:

In [4]:
data = [['John', 25, 'New York'], ['Mary', 30, 'Paris'], ['Peter', 27, 'London'], ['Jane', 21, 'Sydney']]

df = pd.DataFrame(data, columns=['name', 'age', 'city'])

print(df)


    name  age      city
0   John   25  New York
1   Mary   30     Paris
2  Peter   27    London
3   Jane   21    Sydney


**Method 3: From a CSV file**

You can also create a DataFrame from a CSV file using the read_csv() function. For example:

In [5]:
df = pd.read_csv('data1.csv')
#Note that the csv file should be in the same directory
""" In my case i have these data in my data1.csv file.
Name, Age, Gender, Occupation
John, 32, Male, Engineer
Sarah, 28, Female, Doctor
Tom, 45, Male, Lawyer
Emily, 22, Female, Student
Chris, 39, Male, Salesman
"""

print(df)


    Name   Age   Gender  Occupation
0   John    32     Male    Engineer
1  Sarah    28   Female      Doctor
2    Tom    45     Male      Lawyer
3  Emily    22   Female     Student
4  Chris    39     Male    Salesman


**3. Viewing Data**

Once you have created a DataFrame, you may want to view the data. There are several methods you can use to do this:

head() and tail()
The head() and tail() methods are used to view the first and last few rows of the DataFrame, respectively. For example:

In [6]:
data = {'name': ['John', 'Mary', 'Peter', 'Jane'], 'age': [25, 30, 27, 21], 'city': ['New York', 'Paris', 'London', 'Sydney']}

df = pd.DataFrame(data)

print(df.head(2)) # view the first two rows
print(df.tail(2)) # view the last two rows

   name  age      city
0  John   25  New York
1  Mary   30     Paris
    name  age    city
2  Peter   27  London
3   Jane   21  Sydney


**4. Selecting Data**

Once you have a DataFrame, you may want to select certain rows or columns to work with. Here are a few methods you can use to do this:

**a)Selecting Columns**

You can select a single column from a DataFrame by using the column name as an index. For example:

In [7]:
print(df['name'])

0     John
1     Mary
2    Peter
3     Jane
Name: name, dtype: object


You can select multiple columns by passing a list of column names. For example:

In [8]:
print(df[['name', 'city']])


    name      city
0   John  New York
1   Mary     Paris
2  Peter    London
3   Jane    Sydney


**b)Selecting Rows**

You can select a single row from a DataFrame using the iloc method and the row number. For example:

In [9]:
print(df.iloc[1])

name     Mary
age        30
city    Paris
Name: 1, dtype: object


You can select multiple rows by passing a list of row numbers to the iloc method. For example:

In [10]:
print(df.iloc[[0,2]])

    name  age      city
0   John   25  New York
2  Peter   27    London


**c)Selecting Rows and Columns**

You can select specific rows and columns by combining the above methods. For example:

In [11]:
print(df.loc[[1,3], ['name', 'city']])

   name    city
1  Mary   Paris
3  Jane  Sydney


**5. Data Cleaning**

Pandas provides several methods to clean and preprocess data. Here are a few common methods:

**a)Removing Duplicates**

You can remove duplicate rows from a DataFrame using the drop_duplicates() method. For example:

In [12]:
data = {'name': ['John', 'Mary', 'Peter', 'Mary'], 'age': [25, 30, 27, 30], 'city': ['New York', 'Paris', 'London', 'Paris']}

df = pd.DataFrame(data)

print(df)

# Remove duplicate rows
df = df.drop_duplicates()

print(df)

    name  age      city
0   John   25  New York
1   Mary   30     Paris
2  Peter   27    London
3   Mary   30     Paris
    name  age      city
0   John   25  New York
1   Mary   30     Paris
2  Peter   27    London


**b)Handling Missing Values**

Pandas provides several methods for handling missing or null values in a DataFrame. Here are a few common methods:

**i)Checking for Missing Values**

You can check if a DataFrame has any missing values using the isnull() method. For example:

In [13]:
import numpy as np

data = {'name': ['John', 'Mary', np.nan, 'Peter'], 'age': [25, 30, 27, np.nan], 'city': ['New York', 'Paris', 'London', np.nan]}

df = pd.DataFrame(data)

print(df)

# Check for missing values
print(df.isnull())

    name   age      city
0   John  25.0  New York
1   Mary  30.0     Paris
2    NaN  27.0    London
3  Peter   NaN       NaN
    name    age   city
0  False  False  False
1  False  False  False
2   True  False  False
3  False   True   True


**ii)Dropping Missing Values**

You can drop rows with missing values from a DataFrame using the dropna() method. For example:

In [14]:
print(df)

# Drop rows with missing values
df = df.dropna()

print(df.dropna())

    name   age      city
0   John  25.0  New York
1   Mary  30.0     Paris
2    NaN  27.0    London
3  Peter   NaN       NaN
   name   age      city
0  John  25.0  New York
1  Mary  30.0     Paris


**iii) Filling Missing Values**

In Pandas, you can fill missing values using the fillna() method. The fillna() method can take several arguments to specify how to fill the missing values, such as a constant value, a method such as forward fill or backward fill, or interpolation.

**Example 1**: Filling missing values with a constant value

In [15]:
data = {'name': ['John', 'Mary', np.nan, 'Peter'], 'age': [25, 30, 27, np.nan], 'city': ['New York', 'Paris', 'London', np.nan]}

df = pd.DataFrame(data)
# fill missing values with a constant value of 0
df.fillna(0, inplace=True)
print(df)

    name   age      city
0   John  25.0  New York
1   Mary  30.0     Paris
2      0  27.0    London
3  Peter   0.0         0


**Example 2**: Filling missing values with a method

In [16]:
data = {'name': ['John', 'Mary', np.nan, 'Peter'], 'age': [25, 30, 27, np.nan], 'city': ['New York', 'Paris', 'London', np.nan]}

df = pd.DataFrame(data)
# fill missing values using forward fill method
df.fillna(method='ffill', inplace=True)
print(df)

    name   age      city
0   John  25.0  New York
1   Mary  30.0     Paris
2   Mary  27.0    London
3  Peter  27.0    London


**Example 3**: Filling missing values using interpolation

In [17]:
# create the DataFrame
data = {'name': ['John', 'Mary', np.nan, 'Peter'], 'age': [25, 30, 27, np.nan], 'city': ['New York', 'Paris', 'London', np.nan]}
df = pd.DataFrame(data)

# fill missing values using interpolation
df.interpolate(method='linear', inplace=True)
print(df)


    name   age      city
0   John  25.0  New York
1   Mary  30.0     Paris
2    NaN  27.0    London
3  Peter  27.0       NaN


This is a set of basic concepts and examples for using pandas in Python. While this should be enough to get started with pandas, it's not a complete course on the topic.