<a href="https://colab.research.google.com/github/joeljalaganchalappuram/ScholarLab/blob/main/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**PANDAS**

# Installation

to install = '$pip install pandas'

In [3]:
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


# Basics of Pandas
Pandas is an open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools. It is one of the most popular libraries in Python for data manipulation and analysis.

# Key Features of Pandas


1.   Data Structures: Pandas introduces two main data structures


*   Series: A one-dimensional array-like object, similar to a list or a column in a table, with labels (index) for each data point.
*   DataFrame: A two-dimensional, tabular data structure with labeled axes (rows and columns). It is similar to a spreadsheet or a SQL table, making it ideal for handling datasets.


2.   Data Manipulation: It provides powerful tools to perform data manipulation tasks, such as merging,
reshaping, selecting, and cleaning data.
3. Handling Missing Data: Pandas has built-in functions to handle missing or incomplete data.
4. Easy Data Import/Export: It supports reading and writing data from various file formats, including CSV, Excel, SQL databases, and JSON.




# What is a Series?
A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [4]:
a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64


# Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [4]:
print(myvar[0])

1


# Create Labels
With the index argument, you can name your own labels.

In [5]:
a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

x    1
y    7
z    2
dtype: int64


When you have created labels, you can access an item by referring to the label.



In [6]:
print(myvar["y"])


7


# Pandas DataFrames


## What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

Example

In [7]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df)

   calories  duration
0       420        50
1       380        40
2       390        45


# Pandas can Read and load CSV
## Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.





### Creating csv file using pandas


In [8]:
# Creating a simple dataset
data = {
    'Name': ['John', 'Emma', 'Liam', 'Olivia', 'Noah', 'Ava', 'William', 'Sophia', 'James', 'Mia'],
    'Age': [15, 14, 16, 15, 14, 16, 15, 14, 16, 15],
    'Grade': [10, 9, 11, 10, 9, 11, 10, 9, 11, 10],
    'Math Score': [85, 92, 76, 89, 72, 95, 80, 88, 78, 91],
    'Science Score': [78, 88, 83, 90, 65, 93, 85, 87, 82, 92],
    'Sports Participation': ['Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes']
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('students.csv', index=False)

print("CSV file created successfully!")

CSV file created successfully!


## Load the CSV into a DataFrame:

In [10]:
df = pd.read_csv('students.csv')

print(df.to_string())

      Name  Age  Grade  Math Score  Science Score Sports Participation
0     John   15     10          85             78                  Yes
1     Emma   14      9          92             88                   No
2     Liam   16     11          76             83                  Yes
3   Olivia   15     10          89             90                  Yes
4     Noah   14      9          72             65                   No
5      Ava   16     11          95             93                  Yes
6  William   15     10          80             85                   No
7   Sophia   14      9          88             87                  Yes
8    James   16     11          78             82                   No
9      Mia   15     10          91             92                  Yes


###  Tip: use to_string() to print the entire DataFrame.


> note:-
If you have a large DataFrame with many rows, Pandas will only return the first 5 rows, and the last 5 rows:





In [11]:
import random

# Generate a dataset of 100 students
data = {
    'Name': [f'Student {i+1}' for i in range(100)],
    'Age': [random.randint(14, 18) for _ in range(100)],
    'Grade': [random.randint(9, 12) for _ in range(100)],
    'Math Score': [random.randint(50, 100) for _ in range(100)],
    'Science Score': [random.randint(50, 100) for _ in range(100)],
    'Sports Participation': [random.choice(['Yes', 'No']) for _ in range(100)]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('students_extended.csv', index=False)

# Display the first few rows
print(df.head())

        Name  Age  Grade  Math Score  Science Score Sports Participation
0  Student 1   15     10          75             67                   No
1  Student 2   16     12          67             56                  Yes
2  Student 3   16     10          67             65                  Yes
3  Student 4   16     10          77             76                  Yes
4  Student 5   18     11          95             77                   No


### max_rows
The number of rows returned is defined in Pandas option settings.

You can check your system's maximum rows with the pd.options.display.max_rows statement.

In [12]:
print(pd.options.display.max_rows)

60


# Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the head() method.

The head() method returns the headers and a specified number of rows, starting from the top.



In [13]:
df = pd.read_csv('students_extended.csv')

print(df.head(10))

         Name  Age  Grade  Math Score  Science Score Sports Participation
0   Student 1   15     10          75             67                   No
1   Student 2   16     12          67             56                  Yes
2   Student 3   16     10          67             65                  Yes
3   Student 4   16     10          77             76                  Yes
4   Student 5   18     11          95             77                   No
5   Student 6   17     10          59             89                   No
6   Student 7   17     10          68             52                   No
7   Student 8   15     12          84             59                  Yes
8   Student 9   15     10          83             81                   No
9  Student 10   15     12          71             94                  Yes
