# Numpy and Pandas

## Numpy 

NumPy (Numerical Python) is a powerful Python library that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. It serves as the foundation for many other scientific computing libraries in Python. Here are some key aspects of NumPy:

1. Multi-dimensional Arrays: NumPy's main feature is the `ndarray` (n-dimensional array) object, which represents a multi-dimensional grid of elements of the same type. It can be used to store and manipulate data efficiently, such as numbers, images, sound waves, or any other form of numerical data.

2. Homogeneous Data: NumPy arrays contain elements of the same data type, ensuring that all elements have consistent memory layout, which leads to better performance and efficient memory usage.

3. Array Creation: NumPy provides various functions to create arrays, such as `numpy.array()`, `numpy.zeros()`, `numpy.ones()`, `numpy.arange()`, `numpy.linspace()`, and more. These functions allow you to create arrays of different shapes, sizes, and initial values.

4. Array Operations: NumPy provides a wide range of mathematical and logical operations on arrays, including arithmetic operations, trigonometric functions, exponential functions, linear algebra operations, statistical functions, and more. These operations are optimized for performance and can be applied element-wise or along specified axes.

5. Array Indexing and Slicing: NumPy arrays support indexing and slicing operations to access specific elements or sub-arrays. You can use integer indexing, boolean indexing, or even fancy indexing to select elements based on certain conditions or patterns.

6. Broadcasting: NumPy supports broadcasting, which is a powerful mechanism for performing arithmetic operations between arrays with different shapes. Broadcasting automatically adjusts the shape of arrays to make them compatible for element-wise operations.

7. Universal Functions (ufuncs): NumPy provides universal functions that operate element-wise on arrays, performing fast computations on large data sets. These functions are implemented in compiled C code, making them much faster than equivalent Python loops.

8. Array Manipulation: NumPy offers functions to manipulate arrays, such as reshaping, transposing, concatenating, splitting, and stacking arrays. These functions allow you to change the shape and structure of arrays to fit your specific needs.

9. Integration with Python Ecosystem: NumPy seamlessly integrates with other scientific libraries in Python, such as SciPy, Pandas, Matplotlib, and scikit-learn, enabling powerful data analysis, visualization, and machine learning workflows.

10. Efficient Storage and Computation: NumPy arrays are stored contiguously in memory, allowing efficient storage and computation. NumPy also provides functions to save and load arrays from disk in various formats, making it easy to work with large datasets.

NumPy is widely used in various domains, including scientific computing, data analysis, machine learning, and numerical simulations. Its efficient array operations, broadcasting capabilities, and integration with other libraries make it an essential tool for working with numerical data in Python.

In [None]:
# Step 1: Installing NumPy
# If you don't have NumPy installed, you can install it using pip, the Python package installer. Open your terminal or command prompt and run the following command:

# pip install numpy



In [None]:
# Step 2: Importing NumPy
# Once NumPy is installed, you can import it in your Python script or interactive session using the `import` statement:

import numpy as np



In [None]:
# Step 3: Creating NumPy Arrays
# NumPy provides an array object that is similar to Python lists but with additional functionality. You can create a NumPy array by passing a Python list to the `np.array()` function. For example:

import numpy as np

# Create a 1D array
arr1 = np.array([1, 2, 3, 4, 5])
print(arr1)

# Create a 2D array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2)



In [None]:
# Step 4: Array Attributes
# NumPy arrays have several useful attributes that you can access. Some common attributes are:

# - `shape`: Returns the dimensions of the array.
# - `dtype`: Returns the data type of the array elements.

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.shape)  # Output: (2, 3)
print(arr.dtype)  # Output: int64



In [None]:
# Step 5: Array Indexing and Slicing
# You can access elements of a NumPy array using indexing and slicing, similar to Python lists.

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

# Indexing
print(arr[0])  # Output: 1
print(arr[-1])  # Output: 5

# Slicing
print(arr[1:4])  # Output: [2, 3, 4]
print(arr[:3])  # Output: [1, 2, 3]
print(arr[3:])  # Output: [4, 5]



In [None]:
# Step 6: Random Number Generation and Seed Setting
# NumPy provides functions for generating random numbers. You can also set a seed value to make the random numbers reproducible.

import numpy as np

# Generate random numbers
rand_nums = np.random.random((2, 3))
print(rand_nums)

# Set a seed for reproducibility
np.random.seed(42)
rand_nums2 = np.random.random((2, 3))
print(rand_nums2)



In [None]:
# Step 7: Vector Operations
# NumPy allows you to perform vector operations on arrays, such as addition, subtraction, multiplication, and division.

import numpy as np

arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Vector addition
arr_sum = arr1 + arr2
print(arr_sum)  # Output: [5, 7, 9]

# Vector subtraction
arr_diff = arr2 - arr1
print(arr_diff)  # Output: [3, 3, 3]

# Vector multiplication
arr_prod = arr1 * arr2
print(arr_prod)  # Output: [4, 10, 18]

# Vector division
arr_div = arr2 / arr1
print(arr_div) 

## Pandas

Pandas is a powerful open-source library in Python that provides high-performance data manipulation and analysis tools. It is built on top of NumPy and is widely used for data cleaning, preprocessing, exploration, and analysis tasks. Here are some key aspects of Pandas:

1. Data Structures: Pandas introduces two primary data structures: Series and DataFrame.

   - Series: A Series is a one-dimensional labeled array that can hold any data type. It is similar to a column in a spreadsheet or a one-dimensional array with labels.
   - DataFrame: A DataFrame is a two-dimensional labeled data structure with columns of potentially different data types. It is similar to a table or a spreadsheet, where each column represents a variable, and each row represents an observation.

2. Data Alignment: One of the powerful features of Pandas is its ability to automatically align data based on labels. This means you can perform operations on Series or DataFrame objects with different sizes and Pandas will align the data correctly based on the labels.

3. Importing and Exporting Data: Pandas provides various functions to read data from different file formats such as CSV, Excel, SQL databases, and more. It also allows you to export data to different formats.

4. Data Cleaning and Preprocessing: Pandas offers a wide range of functions for handling missing values, duplicate data, data imputation, data transformation, and handling outliers. It provides flexibility in manipulating and cleaning data to prepare it for analysis.

5. Data Exploration: Pandas provides functions for data exploration, including summary statistics, value counts, unique values, correlation analysis, and more. These functions help you understand the characteristics of your dataset and gain insights into the data.

6. Indexing and Selection: Pandas provides powerful indexing and selection capabilities. You can access, slice, and filter data based on row and column labels using various indexing techniques like label-based indexing (`loc`), integer-based indexing (`iloc`), and boolean indexing.

7. Data Manipulation: Pandas allows you to perform various data manipulation tasks such as merging, joining, reshaping, pivoting, and aggregating data. These operations enable you to transform the structure of your data to meet specific requirements.

8. Data Visualization: Pandas integrates well with popular data visualization libraries like Matplotlib and Seaborn, allowing you to create visually appealing plots, charts, and graphs directly from Pandas objects.

9. Time Series Analysis: Pandas provides specialized data structures and functions for working with time series data. It includes functionalities like resampling, time zone handling, date range generation, and shifting.

10. Integration with Other Libraries: Pandas integrates seamlessly with other libraries in the PyData ecosystem, such as NumPy, Matplotlib, Scikit-learn, and more. This integration allows for a streamlined workflow when performing data analysis, preprocessing, modeling, and visualization tasks.

Pandas is widely used in data analysis, data science, and machine learning workflows. Its powerful data manipulation and analysis capabilities, combined with its intuitive API, make it a go-to library for handling and exploring structured data in Python.

In [None]:
# Step 1: Installing Pandas
# If you don't have pandas installed, you can install it using pip, the Python package installer. Open your terminal or command prompt and run the following command:

# pip install pandas



In [None]:
# Step 2: Importing Pandas
# Once pandas is installed, you can import it in your Python script or interactive session using the `import` statement:

import pandas as pd



In [10]:
# Step 3: Loading Data from CSV
# You can load data from a CSV file using the `pd.read_csv()` function. Specify the file path as the argument. For example:

import pandas as pd

# Load data from CSV
# df = pd.read_csv('data.csv')

# use sklearn's provided dataset
# The Iris Dataset contains four features (length and width of sepals and petals) 
# of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). 
from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()

df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
df['target'] = pd.Series(data=iris['target'])

df.head()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [11]:
# Step 4: DataFrame Information
# Pandas provides several functions to get information about a DataFrame.

# - `df.info()`: Displays the basic information about the DataFrame, including column names, data types, and non-null values.
# - `df.describe()`: Generates descriptive statistics of the DataFrame, such as count, mean, min, max, etc.
# - `df.head()`: Displays the first few rows of the DataFrame.

import pandas as pd

# DataFrame Information
df.info()
df.describe()
df.head()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 5.4 KB


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [13]:
# Step 5: DataFrame Merging and Concatenation
# Pandas allows you to merge or concatenate DataFrames.

# - `pd.merge()`: Performs database-style joins on DataFrames.
# - `pd.concat()`: Concatenates DataFrames along a particular axis.

# import pandas as pd

# DataFrame Merging
merged_df = pd.merge(df, df, on='target')

# DataFrame Concatenation
concat_df = pd.concat([df, df], axis=0)

merged_df

Unnamed: 0,sepal length (cm)_x,sepal width (cm)_x,petal length (cm)_x,petal width (cm)_x,target,sepal length (cm)_y,sepal width (cm)_y,petal length (cm)_y,petal width (cm)_y
0,5.1,3.5,1.4,0.2,0,5.1,3.5,1.4,0.2
1,5.1,3.5,1.4,0.2,0,4.9,3.0,1.4,0.2
2,5.1,3.5,1.4,0.2,0,4.7,3.2,1.3,0.2
3,5.1,3.5,1.4,0.2,0,4.6,3.1,1.5,0.2
4,5.1,3.5,1.4,0.2,0,5.0,3.6,1.4,0.2
...,...,...,...,...,...,...,...,...,...
7495,5.9,3.0,5.1,1.8,2,6.7,3.0,5.2,2.3
7496,5.9,3.0,5.1,1.8,2,6.3,2.5,5.0,1.9
7497,5.9,3.0,5.1,1.8,2,6.5,3.0,5.2,2.0
7498,5.9,3.0,5.1,1.8,2,6.2,3.4,5.4,2.3


In [None]:
# Step 6: Renaming Columns
# You can rename columns in a DataFrame using the `df.rename()` function.

# import pandas as pd

# Renaming Columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)



In [None]:
# Step 7: Applying Functions to DataFrame
# You can apply functions to DataFrames using `apply()`, which allows you to pass a custom function or a lambda function.

import pandas as pd

# Applying Functions
df['new_column'] = df['column'].apply(custom_function)
df['new_column'] = df['column'].apply(lambda x: x + 1)



In [None]:
# Step 8: Grouping Data
# You can group data in a DataFrame based on specific columns using `df.groupby()`.

import pandas as pd

# Grouping Data
grouped_df = df.groupby('column')



In [None]:
# Step 9: Indexing with .loc and .iloc
# Pandas provides two indexing methods for accessing data in a DataFrame.

# - `.loc[]`: Allows you to access data using labels.
# - `.iloc[]`: Allows you to access data using integer-based indexing.

import pandas as pd

# Indexing with .loc and .iloc
df.loc[rows, columns]
df.iloc[rows, columns]



In [None]:
# Step 10: Handling Missing Values
# Pandas provides functions to handle missing values in a DataFrame.

# - `df.isnull()`: Checks for missing values and returns a boolean mask.
# - `df.dropna()`: Drops rows or columns with missing values.
# - `df.fillna()`: Fills missing values with a specified value or method.

import pandas as pd

# Handling Missing Values
df.isnull()  # Returns a boolean mask of missing values
df.dropna()  # Drops rows or columns with missing values
df.fillna(value)  # Fills missing values with a specified value or method



In [None]:
# Step 11: Sorting
# Pandas allows you to sort a DataFrame based on column values.

# - `df.sort_values()`: Sorts the DataFrame by specified column(s).
# - `df.sort_index()`: Sorts the DataFrame by index.

import pandas as pd

# Sorting
df.sort_values(by='column')  # Sorts the DataFrame by a single column
df.sort_values(by=['column1', 'column2'])  # Sorts the DataFrame by multiple columns
df.sort_index()  # Sorts the DataFrame by index




In [None]:
# Step 12: Iteration
# Pandas supports iteration over rows or columns in a DataFrame using the `iterrows()` and `iteritems()` methods.

# - `iterrows()`: Iterates over each row as a tuple.
# - `iteritems()`: Iterates over each column as a tuple (column name, column data).

import pandas as pd

# Iteration
for index, row in df.iterrows():
    # Access row values using row['column_name']

    for column_name, column_data in df.iteritems():
    # Access column values using column_data

# Remember, iteration can be slow in pandas, so it is recommended to use vectorized operations whenever possible for better performance.