###Question 1:
Generate a dataset for linear regression with 1000 samples, 5 features and single target.

Visualize the data by plotting the target column against each feature column. Also plot the best fit line in each case.

Hint : search for obtaining regression line using numpy.

In [9]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression as mr
X,y = mr(n_samples=1000, n_features=5, noise=0)
print(X.shape)
print(y.shape)

def plot_data_fit(X, y, feature_name):
  m, b = np.polyfit(X, y, 1)  # Linear fit (degree 1)
  plt.scatter(X, y)
  plt.plot(X, m * X + b, color='red')
  plt.xlabel(feature_name)
  plt.ylabel("Target (y)")
  plt.title(f"Target vs Feature {feature_name} with Best Fit Line")
  plt.show()

for i in range(X.shape[1]):
  plot_data_fit(X[:, i], y, f"X{i+1}")

ModuleNotFoundError: No module named 'numpy'

### Question 2:
Make a classification dataset of 1000 samples with 2 features, 2 classes and 2 clusters per class.
Plot the data.

In [None]:
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt

X, y = make_classification(n_samples=1000, n_features=2, n_classes=2, n_clusters_per_class=2, n_redundant= 0, n_repeated=0)

features = X
target = y

colors = ['blue' if label == 0 else 'red' for label in target]

plt.scatter(features[:, 0], features[:, 1], c=colors)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Classification Dataset with 2 Features and 2 Classes (2 Clusters per Class)")

plt.show()


### Question 3:
Make a clustering dataset with 2 features and 4 clusters.

In [None]:
from sklearn.datasets import make_blobs

n_samples = 1000
n_features = 2

X, y = make_blobs(n_samples=n_samples, n_features=n_features, centers=4, random_state=0)

plt.scatter(X[:, 0], X[:, 1], c=y)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Clustering Dataset with 2 Features and 4 Clusters")
plt.show()

## Question 4
Go to the website https://www.worldometers.info/coronavirus/ and scrape the table containing covid-19 infection and deaths data using requests and BeautifulSoup.  Convert the table to a Pandas dataframe with the following columns : Country, Continent, Population, TotalCases, NewCases, TotalDeaths, NewDeaths,TotalRecovered, NewRecovered,  ActiveCases.

*(<b>Optional Challenge :</b> Change the data type of the Columns (Population ... till ActiveCases) to integer. For that you need to remove the commas and plus signs. You may need to use df.apply() and pd.to_numeric() . Take care of the values which are empty strings.)

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

table = soup.find('table', id='main_table_countries_today')

headers = [th.text.strip() for th in table.find('thead').find_all('th')]

rows = table.find('tbody').find_all('tr')

data = []
for row in rows:
    cells = [td.text.strip() for td in row.find_all('td')]
    data.append(cells[1:9])


df = pd.DataFrame(data, columns=headers[1:9])

#i dont know if its working
def convert_to_int(col):
    """
    Converts a Pandas Series to integers, handling commas, plus signs, and empty strings.
    """
    try:
        return pd.to_numeric(col.str.replace(',', ''), errors='coerce')
    except:
        return col
for col in df.columns[2:]:
    df[col] = df[col].apply(convert_to_int)
df.dropna()

print(df.to_string())

# Question 5

Generate an imbalanced classification dataset using sklearn of 1000 samples with 2 features, 2 classes and 1 cluster per class. Plot the data. One of the class should contain only 5% of the total samples. Confirm this either using numpy or Counter. Plot the data.

Now oversample the minority class to 5 times its initial size using SMOTE. Verify the number. Plot the data.

Now undersample the majority class to 3 times the size of minority class using RandomUnderSampler. Verify the number. Plot the data.

Reference : Last markdown cell of the examples.

In [None]:
from sklearn.datasets import make_classification
from collections import Counter
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import matplotlib.pyplot as plt
import numpy as np

# Define dataset parameters
n_samples = 1000
n_features = 2
n_classes = 2
n_clusters_per_class = 1

# Generate the imbalanced dataset (majority class: 95%, minority class: 5%)
X, y = make_classification(n_samples=n_samples, n_features=n_features, n_classes=n_classes, n_redundant= 0, n_repeated=0,
                           weights=[0.95, 0.05], n_clusters_per_class=n_clusters_per_class, random_state=0)

# Verify class distribution using Counter
class_counts = Counter(y)
print("Original class distribution:", class_counts)
minority_class = min(class_counts, key=class_counts.get)
minority_count = class_counts[minority_class]

# Plot the imbalanced dataset
colors = ['blue' if label == 0 else 'red' for label in y]
plt.scatter(X[:, 0], X[:, 1], c=colors, label='Imbalanced Data')
plt.title("Imbalanced Classification Dataset")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

# Oversample the minority class using SMOTE (5 times)
oversample = SMOTE(sampling_strategy={minority_class: 5 * minority_count})
X_resampled, y_resampled = oversample.fit_resample(X, y)

# Verify oversampling result
resampled_counts = Counter(y_resampled)
print("Oversampled class distribution:", resampled_counts)

# Plot the oversampled data
colors = ['blue' if label == 0 else 'red' for label in y_resampled]
plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=colors, label='Oversampled Data')
plt.title("Oversampled Classification Dataset")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

# Undersample the majority class using RandomUnderSampler (3 times minority class size)
majority_class = (set(range(n_classes)) - {minority_class}).pop()  # Find the majority class
majority_count = class_counts[majority_class]
undersample = RandomUnderSampler(sampling_strategy={majority_class: minority_count * 3})  # Undersample majority to 3x minority size
X_undersampled, y_undersampled = undersample.fit_resample(X, y)

# Verify undersampling result
undersampled_counts = Counter(y_undersampled)
print("Undersampled class distribution:", undersampled_counts)

# Plot the undersampled data
colors = ['blue' if label == 0 else 'red' for label in y_undersampled]
plt.scatter(X_undersampled[:, 0], X_undersampled[:, 1], c=colors, label='Undersampled Data')
plt.title("Undersampled Classification Dataset")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()


##Question 6

Write a Python code to perform data preprocessing on a dataset using the scikit-learn library. Follow the instructions below:

 * Load the dataset using the scikit-learn `load_iris` function.
 * Assign the feature data to a variable named `X` and the target data to a variable named `y`.
 * Create a pandas DataFrame called `df` using `X` as the data and the feature names obtained from the dataset.
 * Display the first 5 rows of the DataFrame `df`.
 *  Check if there are any missing values in the DataFrame and handle them accordingly.
 * Split the data into training and testing sets using the `train_test_split` function from scikit-learn. Assign 70% of the data to the training set and the remaining 30% to the testing set.
 * Print the dimensions of the training set and testing set respectively.
 *  Standardize the feature data in the training set using the `StandardScaler` from scikit-learn.
 *  Apply the same scaling transformation on the testing set.
 * Print the first 5 rows of the standardized training set.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

iris = load_iris()

X = iris.data
y = iris.target

feature_names = iris.feature_names
df = pd.DataFrame(X, columns=feature_names)

print(df.head())

if df.isnull().values.any():
    print("Missing values found!")
else:
    print("No missing values found!")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Training set dimensions: {X_train.shape}, {y_train.shape}")
print(f"Testing set dimensions: {X_test.shape}, {y_test.shape}")

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

print("\nFirst 5 rows of standardized training set:")
print(pd.DataFrame(X_train_scaled[:5], columns=feature_names))
