## What is Pandas?

Pandas is a popular open-source **data manipulation and analysis** library for Python. It provides data structures for efficiently storing and manipulating large datasets, along with functions for reading and writing data in different file formats. The primary data structures in pandas are:

1. **Series:** One-dimensional labeled array capable of holding any data type. It is similar to a column in a spreadsheet or a single column in a DataFrame.

2. **DataFrame:** A two-dimensional table with labeled axes (rows and columns). It is the primary data structure used in pandas and can be thought of as a container for Series objects.

Some key features and functionalities of pandas include:

- **Integration with NumPy:** Pandas is built on top of the NumPy library, which provides high-performance numerical operations. This integration allows for seamless interaction between NumPy and pandas.

- **Data I/O:** Pandas supports various file formats, including CSV, Excel, SQL databases, and more, making it easy to import and export data.

- **Data Exploration:** It allows for easy data exploration and manipulation, such as filtering, grouping, and aggregating data.

- **Data Cleaning:** Pandas provides functions to handle missing data, duplicate values, and other common data cleaning tasks.

- **Time Series Data:** It has robust support for working with time series data, making it suitable for analyzing temporal data.



## Creating Series

A series is a one dimensional array-like object that contains a sequence of values with associated labels, called index. All item in a series contains the same type of data. Here are several ways to create a Series in pandas:

In [10]:
#1. From a List: You can create a Series from a Python list.
import pandas as pd
data_list = [1, 2, 3, 4, 5]
series_from_list = pd.Series(data_list)
print(series_from_list)

0    1
1    2
2    3
3    4
4    5
dtype: int64


In [11]:
#2. From a NumPy Array: Pandas Series can be created from a NumPy array.
import pandas as pd
import numpy as np
data_array = np.array([1, 2, 3, 4, 5])
series_from_array = pd.Series(data_array)
print(series_from_array)

0    1
1    2
2    3
3    4
4    5
dtype: int64


In [12]:
# 3. From a Dictionary: Keys of the dictionary become the index of the Series, and values become the data.
import pandas as pd
data_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
series_from_dict = pd.Series(data_dict)
print(series_from_dict)

a    1
b    2
c    3
d    4
e    5
dtype: int64


**Specifying Index:**

When you create a pandas Series from a list-like object, by default, it is assigned a numerical index. However, if you wish to customize the index labels for better identification, you can do so by using the `index` parameter. This parameter allows you to explicitly specify the index labels you want associated with each element in the Series.

For instance, consider the following code:

In [4]:
import pandas as pd

# Sample data
data = [10, 20, 30, 40, 50]

# Default Series with numerical index
default_series = pd.Series(data)
print("Default Series:")
print(default_series)
# Creating a Series with a custom index
custom_index = ['a', 'b', 'c', 'd', 'e']
#                             data set       the index
series_with_index = pd.Series(data, index=custom_index)
print("\nSeries with Custom Index:")
print(series_with_index)

Default Series:
0    10
1    20
2    30
3    40
4    50
dtype: int64

Series with Custom Index:
a    10
b    20
c    30
d    40
e    50
dtype: int64


In the `default_series`, the default numerical index will be assigned. However, in `series_with_index`, we use the `index` parameter to specify a custom index, resulting in a Series where each element is associated with a label ('a', 'b', 'c', 'd', 'e') for easier reference and interpretation.

## Creating DataFrame

Pandas DataFrame is a 2 dimensional data structure with rows and columns. It is similar to a google sheet or excel file with more than one column.

Here are several ways you can create a DataFrame in pandas:

In [71]:
#1. From a Dictionary of Lists: You can create a DataFrame from a dictionary where keys are column names and values are lists.
import pandas as pd

data_dict = {'Name': ['Alice', 'Bob', 'Charlie'],
              'Age': [25, 30, 35],
              'City': ['New York', 'San Francisco', 'Los Angeles']}

df = pd.DataFrame(data_dict)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,San Francisco
2,Charlie,35,Los Angeles


In [72]:
#2. From a List of Lists: Create a DataFrame directly from a list of lists. The inner lists represent rows.
import pandas as pd

data_list = [['Alice', 25, 'New York'],
              ['Bob', 30, 'San Francisco'],
              ['Charlie', 35, 'Los Angeles']]

df = pd.DataFrame(data_list, columns=['Name', 'Age', 'City'])
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,San Francisco
2,Charlie,35,Los Angeles


In [73]:
#3. From a List of Dictionaries: If your data is in the form of a list of dictionaries, each dictionary represents a row.
import pandas as pd

data_list_of_dicts = [{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
                       {'Name': 'Bob', 'Age': 30, 'City': 'San Francisco'},
                       {'Name': 'Charlie', 'Age': 35, 'City': 'Los Angeles'}]

df = pd.DataFrame(data_list_of_dicts)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,San Francisco
2,Charlie,35,Los Angeles


In [74]:
#4. From a NumPy Array: You can create a DataFrame from a NumPy array and specify column names.
import pandas as pd
import numpy as np
#                         np array is more efficient than python lists
data_array = np.array([[1, 2, 3],
                        [4, 5, 6],
                        [7, 8, 9]])

df = pd.DataFrame(data_array, columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9


In [76]:
#5. From a CSV File: Read data from a CSV file and create a DataFrame. [(Download Irish Dataset)]https://www.kaggle.com/datasets/uciml/iris?resource=download)
#To upload the file from the local drive write the following code in the cell and run it
from google.colab import files
uploaded = files.upload()
#Click on “choose files”, then select and download the CSV file from your local drive.  Later write the following code snippet to import it into a pandas dataframe.

In [77]:
import pandas as pd
# Provide the correct path to the CSV file
file_path = '/content/Iris.csv'

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Display the DataFrame
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


## Accessing Items of a Series

In [45]:
import pandas as pd
# 1. Accessing by index label
series = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
item_a = series['a']
print("1. Accessing by index label:", item_a) +
# Output: 10

# 2. Accessing by integer position using iloc
item_0 = series.iloc[0]
print("2. Accessing by integer position:", item_0)  # Output: 10

# 3. Slicing by integer positions
sliced_series = series[1:3]
print("3. Slicing by integer positions:\n", sliced_series)  # Output: b    20, c    30

# 4. Boolean indexing
condition = series > 15
filtered_series = series[condition]
print("4. Boolean indexing:\n", filtered_series)  # Output: b    20, c    30

# 5. Fancy indexing using a list of labels
items = series[['a', 'c']]
print("5. Fancy indexing:\n", items)  # Output: a    10, c    30


1. Accessing by index label: 10
2. Accessing by integer position: 10
3. Slicing by integer positions:
 b    20
c    30
dtype: int64
4. Boolean indexing:
 b    20
c    30
dtype: int64
5. Fancy indexing:
 a    10
c    30
dtype: int64


## Accessing Columns of a Dataframe:

In [67]:
import pandas as pd
# Sample DataFrame
data = {
    'column_name': [10, 20, 30],
    'column_name1': [1.1, 2.2, 3.3],
    'column_name2': ['A', 'B', 'C'],
    'partial_column_name_extra': [100, 200, 300]
}

df = pd.DataFrame(data)
#print("Sample DataFrame:\n", df)
df

Unnamed: 0,column_name,column_name1,column_name2,partial_column_name_extra
0,10,1.1,A,100
1,20,2.2,B,200
2,30,3.3,C,300


In [69]:
# 1. Using Bracket Notation:
# Selecting a single column
column_data = df['column_name']
print("\n 1. Using Bracket Notation - Single Column:\n", column_data)

# Selecting multiple columns
selected_columns = df[['column_name1', 'column_name2']]
print("\n 1. Using Bracket Notation - Multiple Columns:\n", selected_columns)

# 2. Using Dot Notation (if column names are valid Python identifiers):
# Selecting a single column
column_data = df.column_name
print("\n 2. Using Dot Notation:\n", column_data)
# Note: This method is not suitable if column names have spaces or special characters.

# 3. Filtering Columns by Name:
# Selecting columns with names containing a substring
selected_columns = df.filter(like='partial_column_name')#like='partial_column_name': This means that any column name containing the substring 'partial_column_name' will be selected.
print("\n 3. Filtering Columns by Name:\n", selected_columns)

# 4. Selecting Columns by Data Type:
# Selecting columns of a specific data type (e.g., numerical columns)
selected_columns = df.select_dtypes(include='number')
print("\n 4. Selecting Columns by Data Type:\n", selected_columns)



 1. Using Bracket Notation - Single Column:
 0    10
1    20
2    30
Name: column_name, dtype: int64

 1. Using Bracket Notation - Multiple Columns:
    column_name1 column_name2
0           1.1            A
1           2.2            B
2           3.3            C

 2. Using Dot Notation:
 0    10
1    20
2    30
Name: column_name, dtype: int64

 3. Filtering Columns by Name:
    partial_column_name_extra
0                        100
1                        200
2                        300

 4. Selecting Columns by Data Type:
    column_name  column_name1  partial_column_name_extra
0           10           1.1                        100
1           20           2.2                        200
2           30           3.3                        300


## Accessing Rows of a Dataframe

In [64]:
import pandas as pd

# Sample DataFrame creation with index labels
data = {
    'column_name': [10, 60, 30, 80, 50],
    'column_name1': [1.1, 2.2, 3.3, 4.4, 5.5],
    'column_name2': ['A', 'B', 'C', 'D', 'E']
}
df = pd.DataFrame(data, index=['row0', 'row1', 'row2', 'row3', 'row4'])
#print("Sample DataFrame:\n", df)
df

Unnamed: 0,column_name,column_name1,column_name2
row0,10,1.1,A
row1,60,2.2,B
row2,30,3.3,C
row3,80,4.4,D
row4,50,5.5,E


In [66]:
# 1. Integer Indexing (iloc): Select a specific item by providing integer positions.

# Selecting a single row by index
row = df.iloc[2]
print("\n1. Selecting a single row by index:\n", row)  # Output: row2 values

# Selecting multiple rows by index
rows = df.iloc[2:5]  # selects rows 2 through 4
print("\n 1. Selecting multiple rows by index:\n", rows)

# 2. Label Indexing (loc): Select a specific item by providing row label.

# Selecting a single row by label
row = df.loc['row2']
print("\n2. Selecting a single row by label:\n", row)  # Output: row2 values

# Selecting multiple rows by label
rows = df.loc['row2':'row4']
print("\n2. Selecting multiple rows by label:\n", rows)

# 3. Conditional Selection: Select items that satisfy a condition.

# Selecting rows based on a condition
condition = df['column_name'] > 50
selected_rows = df[condition]
print("\n3. Selecting rows based on a condition:\n", selected_rows)

# Combining conditions using logical operators
combined_condition = (df['column_name'] > 20) & (df['column_name1'] < 5)
combined_selected_rows = df[combined_condition]
print("\n3. Combining conditions with & (and):\n", combined_selected_rows)
#You can combine conditions using logical operators like & (and), | (or), and ~ (not).

# 4. Using the query Method:

# Selecting rows using a query string
selected_rows = df.query('column_name > 50')
print("\n4. Selecting rows using a query string:\n", selected_rows)

# 5. Selecting Rows with Specific Values of a column:

# Selecting rows with specific values in a column
selected_rows = df[df['column_name'].isin([30, 80])]
print("\n5. Selecting rows with specific values in a column:\n", selected_rows)



1. Selecting a single row by index:
 column_name      30
column_name1    3.3
column_name2      C
Name: row2, dtype: object

 1. Selecting multiple rows by index:
       column_name  column_name1 column_name2
row2           30           3.3            C
row3           80           4.4            D
row4           50           5.5            E

2. Selecting a single row by label:
 column_name      30
column_name1    3.3
column_name2      C
Name: row2, dtype: object

2. Selecting multiple rows by label:
       column_name  column_name1 column_name2
row2           30           3.3            C
row3           80           4.4            D
row4           50           5.5            E

3. Selecting rows based on a condition:
       column_name  column_name1 column_name2
row1           60           2.2            B
row3           80           4.4            D

3. Combining conditions with & (and):
       column_name  column_name1 column_name2
row1           60           2.2            B
row2  

## Accesing Rows and Columns

In [59]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 22, 35, 28],
    'Score': [85, 92, 78, 95, 89],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston']
}

df = pd.DataFrame(data, index=['row1', 'row2', 'row3', 'row4', 'row5'])

df

Unnamed: 0,Name,Age,Score,City
row1,Alice,25,85,New York
row2,Bob,30,92,San Francisco
row3,Charlie,22,78,Los Angeles
row4,David,35,95,Chicago
row5,Emma,28,89,Boston


In [62]:
# 1. Using loc with Row and Column Labels: Select specific rows and columns by providing labels.

# Selecting specific rows and columns by labels
selected_data = df.loc[['row2', 'row3'], ['Name', 'Score']]
print("\n1. Using loc with Row and Column Labels:\n", selected_data)

# 2. Using iloc with Integer Positions: Select specific rows and columns by providing integer positions.

# Selecting specific rows and columns by integer positions
selected_data = df.iloc[[0, 1], [0, 2]]  # Here, selecting rows 0 and 1, and columns 0 (Name) and 2 (Score)
print("\n2. Using iloc with Integer Positions:\n", selected_data)

# 3. Selecting a Range of Rows and Columns: Using slices with loc and iloc to select ranges of rows and columns.

# Using loc to select a range of rows and columns
selected_data_loc = df.loc['row2':'row4', 'Name':'City']
print("\n3. Selecting a Range of Rows and Columns using loc:\n", selected_data_loc)

# Using iloc to select a range of rows and columns
#                            r     c
selected_data_iloc = df.iloc[1:4, 0:3]  # Selecting from row index 1 to 3 and column index 0 to 2
print("\n3. Selecting a Range of Rows and Columns using iloc:\n", selected_data_iloc)



1. Using loc with Row and Column Labels:
          Name  Score
row2      Bob     92
row3  Charlie     78

2. Using iloc with Integer Positions:
        Name  Score
row1  Alice     85
row2    Bob     92

3. Selecting a Range of Rows and Columns using loc:
          Name  Age  Score           City
row2      Bob   30     92  San Francisco
row3  Charlie   22     78    Los Angeles
row4    David   35     95        Chicago

3. Selecting a Range of Rows and Columns using iloc:
          Name  Age  Score
row2      Bob   30     92
row3  Charlie   22     78
row4    David   35     95
