# Pandas DataFrame

**Pandas DataFrame** is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

![Pandas DataFrames](https://github.com/leone-nyaga/Data_Analysis/blob/main/images/pandas-illustration.png)

## Creating a Pandas DataFrame

Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file.
Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc.

+ [Creating a DataFrame using a list](https://www.geeksforgeeks.org/create-a-pandas-dataframe-from-lists/): DataFrame can be created using a single list or a list of lists.

In [1]:
import pandas as pd
 
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is', 
            'portal', 'for', 'Geeks']
 
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)


        0
0   Geeks
1     For
2   Geeks
3      is
4  portal
5     for
6   Geeks


+ [Creating DataFrame from dict of ndarray/lists](https://www.geeksforgeeks.org/python-create-a-pandas-dataframe-from-a-dict-of-equal-length-lists/): To create DataFrame from dict of narray/list, all the narray must be of same length.
If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length.

In [2]:
# Python code demonstrate creating 
# DataFrame from dict narray / lists 
# By default addresses.
 
import pandas as pd
 
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
        'Age':[20, 21, 19, 18]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)

    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18


# Dealing with Rows and Columns in Pandas DataFrame

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.
We can perform basic operations on rows/columns like selecting, deleting, adding, and renaming.

+ [Column Selection](https://www.geeksforgeeks.org/how-to-select-multiple-columns-in-a-pandas-dataframe/): In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.

In [1]:
# Import pandas package
import pandas as pd
 
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
 
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data)
 
# select two columns
print(df[['Name', 'Qualification']])

     Name Qualification
0     Jai           Msc
1  Princi            MA
2  Gaurav           MCA
3    Anuj           Phd


In [None]:
EAZZY DUZZ IT!

+ [Row Selection](https://www.geeksforgeeks.org/python-pandas-extracting-rows-using-loc/): Pandas provide a unique method to retrieve rows from a Data frame.
[DataFrame.loc[]](https://www.geeksforgeeks.org/python-pandas-extracting-rows-using-loc/) method is used to retrieve rows from Pandas DataFrame. Rows can also be selected by passing integer location to an [iloc[]](https://www.geeksforgeeks.org/python-pandas-extracting-rows-using-loc/) function.

p.s. loc -> location, iloc -> integer location

In [4]:
import pandas as pd

# Sample DataFrame with Name, Age, and City
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 30, 35, 40, 22],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']}

df = pd.DataFrame(data)
print(df.loc[2])
print("\n")

print(df.iloc[0])
print("\n")

print(df.iloc[0:3])  # Selects rows at position 0, 1, and 2(slicing)
print("\n")

# Set custom index labels
df_custom = df.set_index('Name')
print(df_custom)
print("\n")
# Access using the custom index
print(df_custom.loc['Alice'])  # Accessing Alice by her name




Name    Charlie
Age          35
City    Chicago
Name: 2, dtype: object


Name       Alice
Age           25
City    New York
Name: 0, dtype: object


      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


         Age         City
Name                     
Alice     25     New York
Bob       30  Los Angeles
Charlie   35      Chicago
David     40      Houston
Eva       22      Phoenix


Age           25
City    New York
Name: Alice, dtype: object


# Indexing and Selecting Data in Pandas

Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame.
Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns.
Indexing can also be known as **Subset Selection**.

      **THIS CAN BE PROPERLY EXPLAINED WITH AN EXAMPLE**

In [3]:
"""SAMPLE DATA"""
import pandas as pd

# Creating a sample DataFrame
info = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 30, 35, 40, 22],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'Salary': [50000, 60000, 70000, 80000, 40000]
}

df1 = pd.DataFrame(info)

# Display the DataFrame
print(df1)

      Name  Age         City  Salary
0    Alice   24     New York   50000
1      Bob   30  Los Angeles   60000
2  Charlie   35      Chicago   70000
3    David   40      Houston   80000
4      Eva   22      Phoenix   40000


1. Selecting Specific Columns

We can select only the **Name** and **Salary** columns.

In [4]:
subset_columns = df1[['Name', 'Salary']]
print(subset_columns)

      Name  Salary
0    Alice   50000
1      Bob   60000
2  Charlie   70000
3    David   80000
4      Eva   40000


2. selecting the first 3 rows

We use the head(n) method

In [4]:
first_three = df1.head(3)
print(first_three)

      Name  Age         City  Salary
0    Alice   24     New York   50000
1      Bob   30  Los Angeles   60000
2  Charlie   35      Chicago   70000


3. Select who are between the ages of 25 and 35

A range!!

In [7]:
age_range = df1[(df1['Age'] >= 25) & (df1['Age'] <= 35)]
print(age_range)

      Name  Age         City  Salary
1      Bob   30  Los Angeles   60000
2  Charlie   35      Chicago   70000


4. Finding the highest Salary

finding the highest salary using the max() method

In [9]:
max_salary = df1[df1['Salary'] == df1['Salary'].max()]
print(max_salary)

    Name  Age     City  Salary
3  David   40  Houston   80000


5. Select rows where Name starts with 'A' and show only City and Salary:

In [17]:
starts_with_a = df1.loc[df1['Name'].str.startswith('A'), ['Name', 'City', 'Salary']]
print(starts_with_a)

    Name      City  Salary
0  Alice  New York   50000
