# Introduction to Pandas, Numpy and Matplot
This notebook introduces the basics of Pandas, Numpy, and Matplotlib, essential Python libraries for data manipulation, analysis, and visualization. Pandas provides structures like Series and DataFrame for handling structured data, Numpy offers powerful N-dimensional arrays and mathematical functions, and Matplotlib enables creating a wide range of static, animated, and interactive plots.

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Pandas

In [None]:
# Importing data
data = pd.read_csv('BodyPerformance.csv')
df = data.copy()

In [None]:
print("First few rows of the DataFrame:")
df.head()

In [None]:
print("Last few rows of the DataFrame:")
df.tail()

In [None]:
print(f"Dimension of the DataFrame: {df.shape}")

In [None]:
print(f"Data types of the columns: {df.dtypes}")

In [None]:
print(f"Index of the DataFrame: {df.index}")

In [None]:
# Head of the age column
df['age'].head()

In [None]:
# Minimum age
df['age'].min()

In [None]:
# Maximum age
df['age'].max()

In [None]:
# Mean age
df['age'].mean()

Correlation is a statistical measure that describes the extent to which two variables move in relation to each other. It is quantified by the correlation coefficient, which ranges from -1 to 1. A value of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no linear relationship. It's important to note that correlation does not imply causation, and it only measures linear relationships.

In [None]:
df.corr()

The `.describe()` method in Pandas provides a summary of statistical measures for numerical columns in a DataFrame. It includes metrics such as count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum. This method is useful for quickly understanding the distribution and central tendency of the data.

In [None]:
df.describe()

`NaN` (Not a Number) represents missing or undefined values in a DataFrame. Handling `NaN` values is crucial because they can lead to inaccurate analysis and affect the performance of statistical methods and machine learning algorithms. Common strategies to address `NaN` values include removing them or filling them with appropriate values like mean, median, or mode to ensure data integrity and reliability.

In [None]:
df.isnull().sum()

In [None]:
# Filling missing values in the 'age' column with the mean value
df['age'].fillna(df['age'].mean(), inplace=True)

# Filling missing values in the 'age' column with the mode value
# Note: This line will overwrite the previous fill operation
df['age'].fillna(df['age'].mode()[0], inplace=True)

# Checking the number of remaining missing values in the 'age' column
missing_values_count = df['age'].isnull().sum()
print(f"Number of missing values in 'age' column: {missing_values_count}")

The `.map()` method in Pandas is used to transform values in a Series based on a mapping relationship. It can map values using a dictionary, apply a function to each element, or map values from another Series. This method is useful for data transformation and cleaning, enabling efficient and readable code.

In [None]:
df['gender']=df['gender'].map({"M":'0',"F":"1"})
df.tail()

The `.apply()` method in Pandas is used to apply a function along an axis of the DataFrame (either rows or columns). It is versatile and can be used for complex data transformations and aggregations. This method is particularly useful for applying custom functions to each element, row, or column, enabling flexible and powerful data manipulation.

In [None]:
df['height_cm'] = df['height_cm'].apply(lambda x: x*0.01)
df.head()

In [None]:
# Adding a new column 'BMI' to the DataFrame
df['BMI'] = df['weight_kg'] / (df['height_cm'] ** 2)
df.head()

In [None]:
# Dropping the 'class' column from the DataFrame
# The 'axis=1' parameter specifies that we are dropping a column (not a row)
# The 'inplace=True' parameter ensures that the change is made directly to the original DataFrame
df.drop(['class'], axis=1, inplace=True)
df.head()

In [None]:
# Renaming the 'age' column to 'Age'
# The 'inplace=True' parameter ensures that the change is made directly to the original DataFrame
df.rename(columns={'age':'Age'},inplace=True)
df.head()

In [None]:
# Filtering the DataFrame to include only rows where the 'Age' column is equal to 25
# This creates a new DataFrame 'df_age_25' containing only the rows with Age 25
df_age_25 = df[df['Age']==25]
df_age_25.head()

In [None]:
# Filtering the DataFrame to include only rows where the 'Age' column is equal to 25
# and the 'gender' column is equal to '1' (assuming '1' represents female)
# This creates a new DataFrame 'df_female_or_age_25' containing only the rows that meet both conditions
df_female_and_age_25 = df[(df['Age']==25) & (df['gender']=='1')]
df_female_and_age_25.head()

The `.loc[]` method in Pandas is used for label-based indexing and selection. It allows you to select rows and columns by their labels or a boolean array. This method is versatile and can be used to slice data, filter rows, and select specific columns. It is particularly useful for accessing a group of rows and columns by labels or conditional statements.

##### Options for Using `.loc[]`

- Selecting Rows and Columns by Labels
You can select specific rows and columns by their labels, which is useful for accessing data based on column names and row indices.

- Selecting All Rows or Columns
Using `:` allows you to select all rows or all columns. This is helpful when you want to apply operations to entire rows or columns without specifying each one individually.

- Conditional Selection
You can filter rows based on conditions. For example, you can select rows where a column's value meets a certain condition, such as being greater than a specific number.

- Selecting with Boolean Arrays
You can pass a boolean array to `.loc[]` to select rows that meet certain criteria. This is useful for more complex filtering based on multiple conditions.

The `.loc[]` method is a powerful tool for label-based indexing and selection in Pandas, supporting a wide range of data manipulation and analysis tasks.

In [None]:
# Using .loc[] to select rows from index 0 to 100 and columns from 'Age' to 'weight_kg'
# This creates a new DataFrame 'subset_df' containing the specified rows and columns
subset_df = df.loc[0:100, 'Age':'weight_kg']
subset_df.head()

The `.iloc[]` method in Pandas is used for integer-based indexing and selection. It allows you to access rows and columns by their integer positions, making it useful for numerical indexing.

##### Options for Using `.iloc[]`

- Selecting Rows and Columns by Integer Positions
You can select specific rows and columns by their integer positions, which is useful for accessing data based on numerical indices.

- Selecting All Rows or Columns
Using `:` allows you to select all rows or all columns. This is helpful when you want to apply operations to entire rows or columns without specifying each one individually.

- Slicing
You can use slicing to select a range of rows or columns. For example, you can select rows from index 0 to 10 and columns from index 1 to 3.

- Conditional Selection
While `.iloc[]` itself does not support boolean conditions directly, you can use it in combination with other methods to achieve conditional selection based on integer positions.

The `.iloc[]` method is a powerful tool for integer-based indexing and selection in Pandas, supporting a wide range of data manipulation and analysis tasks.

In [None]:
# Using .iloc[] to select rows from index 0 to 100 and columns from index 0 to 5
# This creates a new DataFrame 'subset_df' containing the specified rows and columns
subset_df = df.iloc[0:100, 0:5]
subset_df.head()

## Numpy

In [None]:
# Creating a NumPy array with the specified elements
# The array contains floating-point and integer values
array_example = np.array([3.2, 4, 6, 5])
array_example

In [None]:
# Creating a NumPy array with integer elements
# The 'dtype="str"' parameter specifies that the elements should be stored as strings
string_array = np.array([1, 2, 3, 4], dtype="str")
string_array

In [None]:
float_array = np.array([3,6,2,3], dtype="float32")
float_array

In [None]:
array_example = np.array([range(i, i+3) for i in [2, 4, 6]])
array_example

In [None]:
np.zeros(10, dtype="int")

In [None]:
np.zeros((5,6), dtype="float")

In [None]:
np.ones((3,5), dtype="float")

In [None]:
np.full((3,5), np.pi)

In [None]:
np.arange(0, 20, 2)

In [None]:
np.linspace(0, 1, 5)

Setting a seed in random number generation ensures reproducibility of results. When you set a seed, the sequence of random numbers generated will be the same each time the code is run. This is particularly useful for debugging and sharing code, as it allows others to replicate the exact sequence of random numbers. In NumPy, you can set the seed using `np.random.seed(value)`.

In [None]:
# Setting the seed for NumPy's random number generator to ensure reproducibility
# This means that the random numbers generated will be the same each time the code is run
np.random.seed(0)

In [None]:
# Generating a 3x3 array of random floating-point numbers between 0 and 1
# The np.random.random() function creates an array of the given shape and populates it with random samples from a uniform distribution over [0, 1)
np.random.random((3, 3))

In [None]:
# Generating an array of random numbers from a normal (Gaussian) distribution
# The np.random.normal() function takes three parameters:
# - The mean (0 in this case)
# - The standard deviation (1 in this case)
# - The shape of the output array ((3, 3) in this case)
np.random.normal(0, 1, (3, 3))

In [None]:
# Generating an array of random integers between 0 and 9 (inclusive)
# The np.random.randint() function takes three parameters:
# - The lower bound of the range (0 in this case, inclusive)
# - The upper bound of the range (10 in this case, exclusive)
# - The shape of the output array ((3, 3) in this case)
x = np.random.randint(0, 10, size=(2, 3, 4))

In [None]:
# The number of dimensions of the array
print(f"x ndim: {x.ndim}")

# The shape of the array (dimensions)
print(f"x shape: {x.shape}")

# The total number of elements in the array
print(f"x size: {x.size}")

# The data type of the elements in the array
print(f"Data type: {x.dtype}")

# The size (in bytes) of each element in the array
print(f"Item size: {x.itemsize} bytes")

# The total number of bytes consumed by the elements of the array
print(f"nbytes: {x.nbytes} bytes")

In [None]:
x

In [None]:
x[0,0,0]

In [None]:
x[-2,-2,-2]

In [None]:
x[0,0,0] = 12
x

NumPy slicing allows you to extract a subset of elements from an array. Slicing is performed using the colon (`:`) operator, and it can be applied to both one-dimensional and multi-dimensional arrays.

In [None]:
x = np.arange(36).reshape(3, 3, 4)

In [None]:
# Slicing the 3D array 'x' to extract a subset
x[:, :2, 1:3]

In [None]:
# Slicing the 3D array 'x' using start, stop, and step parameters, similar to range()
# The slicing parameters are:
# - First dimension: start at index 0, stop at index 3, step by 2 (every second element)
# - Second dimension: start at index 0, stop at index 3, step by 1 (every element)
# - Third dimension: start at index 0, stop at index 4, step by 2 (every second element)
x[0:3:, 0:3:1, 0:4:2]

The `reshape` function in NumPy is used to change the shape of an existing array without altering its data. This is particularly useful for organizing data into different dimensions for various computational tasks.

In [None]:
x = np.arange(1, 13)
print(f"Shape of x: {x.shape}")
x

In [None]:
# Reshaping the array 'x' to a new shape (3, 4)
# The reshape() function changes the shape of the array without altering its data
# The new shape must be compatible with the original number of elements in 'x'
x = x.reshape((3, 4))
x

In [None]:
# Reshaping the array 'x' to a new shape (1, -1)
# The reshape() function changes the shape of the array without altering its data
# The new shape (1, -1) means:
# - 1 row
# - The number of columns is automatically calculated based on the total number of elements
x = x.reshape(1, -1)
print(f"Shape of x after unrolling: {x.shape}")
x

In [None]:
# Creating two 1D NumPy arrays 'x' and 'y'
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])

# Concatenating the two arrays 'x' and 'y' along the first axis (default)
# The np.concatenate() function combines the arrays into a single array
np.concatenate((x, y))

In [None]:
# Creating a 2D NumPy array 'x' with shape (2, 3)
x = np.array([[1, 2, 3],
              [4, 5, 6]])

# Concatenating the array 'x' with itself along the first axis (default)
# The np.concatenate() function combines the arrays into a single array
# Since no axis is specified, it defaults to axis 0 (rows)
np.concatenate((x, x))

In [None]:
# Concatenating the 2D array 'x' with itself along the second axis (columns)
# The np.concatenate() function combines the arrays into a single array
# The parameter axis=1 specifies that the concatenation should be done along the columns
np.concatenate((x, x), axis=1)

In [None]:
# Creating a 1D NumPy array 'x' with specified elements
x = np.array([1, 2, 3, 99, 99, 3, 2, 1])

# Splitting the array 'x' into four sub-arrays at the specified indices [3, 4, 5]
# The np.split() function splits the array into sub-arrays at the given indices
x1, x2, x3, x4 = np.split(x, [3, 4, 5])

# Displaying the resulting sub-arrays
print(x1, x2, x3, x4)

NumPy provides a wide range of arithmetic operations that can be performed on arrays. These operations are element-wise, meaning they are applied to each element of the array individually. This makes NumPy highly efficient for numerical computations.

In [None]:
# Creating a 1D NumPy array 'x' with specified elements
x = np.array([1, 5, 8, 9])

# Displaying the original array
print("x =", x)

# Performing arithmetic operations on the array 'x'
print("x + 5 =", x + 5)  # Adding 5 to each element
print("x - 5 =", x - 5)  # Subtracting 5 from each element
print("x * 2 =", x * 2)  # Multiplying each element by 2
print("x / 2 =", x / 2)  # Dividing each element by 2
print("x // 2 =", x // 2)  # Floor division of each element by 2
print("-x =", -x)  # Negating each element
print("x ** 2 =", x ** 2)  # Squaring each element
print("x % 2 =", x % 2)  # Modulus operation on each element

# Performing trigonometric operations on the array 'x'
print("sin(x) =", np.sin(x))  # Sine of each element
print("cos(x) =", np.cos(x))  # Cosine of each element
print("tan(x) =", np.tan(x))  # Tangent of each element

# Performing logarithmic operations on the array 'x'
print("ln(x) =", np.log(x))  # Natural logarithm of each element
print("log2(x) =", np.log2(x))  # Base-2 logarithm of each element
print("log10(x) =", np.log10(x))  # Base-10 logarithm of each element

# Performing comparison operations on the array 'x'
print("x < 3 =", x < 3)  # Less than comparison
print("x > 3 =", x > 3)  # Greater than comparison
print("x <= 3 =", x <= 3)  # Less than or equal to comparison
print("x >= 3 =", x >= 3)  # Greater than or equal to comparison
print("x != 3 =", x != 3)  # Not equal to comparison
print("x == 3 =", x == 3)  # Equal to comparison

# Performing a combined operation on the array 'x'
print("(2 * x) == (2 ** x) =", (2 * x) == (2 ** x))  # Comparing 2*x with 2 raised to the power of x

In [None]:
-(0.5*x+1) ** 2

In [None]:
# Creating a 1D NumPy array 'x' with some NaN values
x = np.array([1, 2, np.nan, 4, 5])

# Calculating the sum of the array 'x'
# np.sum() returns NaN if there are any NaN values in the array
print("np.sum       =", np.sum(x))

# Calculating the sum of the array 'x' while ignoring NaN values
# np.nansum() treats NaN values as zero
print("np.nansum    =", np.nansum(x))

# Calculating the mean of the array 'x'
# np.mean() returns NaN if there are any NaN values in the array
print("np.mean      =", np.mean(x))

# Calculating the mean of the array 'x' while ignoring NaN values
# np.nanmean() ignores NaN values in the calculation
print("np.nanmean   =", np.nanmean(x))

# Calculating the standard deviation of the array 'x'
# np.std() returns NaN if there are any NaN values in the array
print("np.std       =", np.std(x))

# Calculating the standard deviation of the array 'x' while ignoring NaN values
# np.nanstd() ignores NaN values in the calculation
print("np.nanstd    =", np.nanstd(x))

# Finding the index of the minimum value in the array 'x'
# np.argmin() returns the index of the first occurrence of NaN if there are any NaN values
print("np.argmin    =", np.argmin(x))

# Finding the index of the minimum value in the array 'x' while ignoring NaN values
# np.nanargmin() ignores NaN values in the calculation
print("np.nanargmin =", np.nanargmin(x))

NumPy provides powerful and efficient functions for sorting arrays. Sorting is a common operation in data analysis and scientific computing, and NumPy's sorting functions are optimized for performance.

In [None]:
# Basic Sorting with np.sort()
arr = np.array([3, 1, 2])
np.sort(arr)

In [None]:
# Sorting Indices with np.argsort()
arr = np.array([3, 1, 2])
np.argsort(arr)

In [None]:
# Sorting a 3D Array
arr = np.array([[[3, 2, 1], [6, 5, 4]], [[9, 8, 7], [12, 11, 10]]])
np.sort(arr, axis=2)

Broadcasting is a powerful feature in NumPy that allows arithmetic operations on arrays of different shapes. It enables NumPy to perform element-wise operations without explicitly replicating the smaller array, leading to more efficient and readable code.

#### Rules of Broadcasting
#### Rule 1: Dimension Padding
- If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side. This ensures that both arrays have the same number of dimensions, facilitating element-wise operations.

#### Rule 2: Dimension Stretching
- If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape. This virtual stretching allows the smaller array to be broadcast across the larger array without actual data replication.

##### Rule 3: Dimension Mismatch Error
- If in any dimension the sizes disagree and neither is equal to 1, an error is raised. This ensures that only compatible shapes can be broadcast together, preventing ambiguous operations.

Broadcasting in NumPy simplifies arithmetic operations on arrays of different shapes by virtually expanding the smaller array to match the shape of the larger array. The three key rules of broadcasting are:
1. **Dimension Padding**: Pad the shape of the smaller array with ones on the left side.
2. **Dimension Stretching**: Stretch the array with shape equal to 1 in any dimension to match the other shape.
3. **Dimension Mismatch Error**: Raise an error if the sizes disagree in any dimension and neither is equal to 1.

Understanding these rules helps in efficiently performing numerical computations and avoiding common pitfalls.

In [None]:
# Broadcasting with Different Dimensions
# Creating a 1D array and a 2D array
A = np.array([1, 2, 3])  # Shape (3,)
B = np.array([[10], [20], [30]])  # Shape (3, 1)

# Broadcasting A to match the shape of B
# A is padded to shape (1, 3) and then stretched to shape (3, 3)
result = A + B
result

In [None]:
# Broadcasting with Higher Dimensions
# Creating a 3D array and a 2D array
C = np.array([[[1, 2, 3], [4, 5, 6]]])  # Shape (1, 2, 3)
D = np.array([[10], [20]])  # Shape (2, 1)

# Broadcasting D to match the shape of C
# D is padded to shape (1, 2, 1) and then stretched to shape (1, 2, 3)
result = C + D
result


In [None]:
# Broadcasting with Scalars
# Creating a 2D array
E = np.array([[1, 2, 3], [4, 5, 6]])  # Shape (2, 3)

# Broadcasting a scalar to match the shape of E
# The scalar is treated as an array with shape (1,) and then stretched to shape (2, 3)
result = E * 2
result


In [None]:
# Broadcasting with Mismatched Shapes
# Creating two 2D arrays with mismatched shapes
F = np.array([[1, 2, 3], [4, 5, 6]])  # Shape (2, 3)
G = np.array([[10], [20]])  # Shape (2, 1)

# Broadcasting G to match the shape of F
# G is stretched to shape (2, 3)
result = F + G
result


In [None]:
# Broadcasting with Multi-Dimensional Arrays
# Creating a 3D array and a 1D array
H = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])  # Shape (2, 2, 3)
I = np.array([10, 20, 30])  # Shape (3,)

# Broadcasting I to match the shape of H
# I is padded to shape (1, 1, 3) and then stretched to shape (2, 2, 3)
result = H + I
result

The dot product of two vectors is the sum of the products of their corresponding elements. For matrices, the dot product is equivalent to matrix multiplication.
- The `np.dot()` function computes the dot product of two arrays. It can handle both vectors and matrices.
- The `@` operator is a shorthand for the `np.dot()` function and can be used for matrix multiplication.

In [None]:
# Dot Product of 1D and 2D Arrays
vector = np.array([1, 2])
matrix = np.array([[3, 4], [5, 6]])
np.dot(vector, matrix)

In [None]:
# Transposing a 2D Array (Matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6]])
print(matrix)

# Transposing the matrix
transposed_matrix = np.transpose(matrix)
print(transposed_matrix)

# Alternatively, you can use the .T attribute to transpose
transposed_matrix_T = matrix.T
print(transposed_matrix_T)

## Matplot

In [None]:
# Creating a dictionary with more data
data = {
    'Year': [2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023],
    'Sales': [200, 300, 400, 500, 600, 700, 800, 900, 1000],
    'Profit': [20, 30, 40, 50, 60, 70, 80, 90, 100],
    'Expenses': [180, 270, 360, 450, 540, 630, 720, 810, 900]
}

# Converting the dictionary to a DataFrame
df = pd.DataFrame(data)

# Setting the figure size
plt.figure(figsize=(12, 8))

# Line plot for Sales
plt.subplot(2, 2, 1)  # Creating a subplot (2 rows, 2 columns, 1st plot)
df.plot(x='Year', y='Sales', kind='line', ax=plt.gca())  # Plotting line plot for Sales
plt.title('Sales Over Years')  # Setting the title for the subplot
plt.xlabel('Year')  # Setting the x-axis label
plt.ylabel('Sales')  # Setting the y-axis label
plt.grid(True)  # Enabling the grid

# Bar plot for Profit
plt.subplot(2, 2, 2)  # Creating a subplot (2 rows, 2 columns, 2nd plot)
df.plot(x='Year', y='Profit', kind='bar', ax=plt.gca())  # Plotting bar plot for Profit
plt.title('Profit Over Years')  # Setting the title for the subplot
plt.xlabel('Year')  # Setting the x-axis label
plt.ylabel('Profit')  # Setting the y-axis label
plt.grid(True)  # Enabling the grid

# Scatter plot for Expenses
plt.subplot(2, 2, 3)  # Creating a subplot (2 rows, 2 columns, 3rd plot)
df.plot(x='Year', y='Expenses', kind='scatter', ax=plt.gca())  # Plotting scatter plot for Expenses
plt.title('Expenses Over Years')  # Setting the title for the subplot
plt.xlabel('Year')  # Setting the x-axis label
plt.ylabel('Expenses')  # Setting the y-axis label
plt.grid(True)  # Enabling the grid

# Line plot for Sales and Profit
plt.subplot(2, 2, 4)  # Creating a subplot (2 rows, 2 columns, 4th plot)
df.plot(x='Year', y=['Sales', 'Profit'], kind='line', ax=plt.gca())  # Plotting line plot for Sales and Profit
plt.title('Sales and Profit Over Years')  # Setting the title for the subplot
plt.xlabel('Year')  # Setting the x-axis label
plt.ylabel('Amount')  # Setting the y-axis label
plt.grid(True)  # Enabling the grid

# Customizing the overall plot
plt.suptitle('Company Financial Data Over Years')  # Setting the overall title for the figure
plt.tight_layout(rect=[0, 0, 1, 0.96])  # Adjusting the layout to make room for the suptitle

# Displaying the plots
plt.show()