##                                     **Machine learning**

**INTERMEDIATE QUESTIONS :**

**Q-1.** Imagine you have a dataset where you have different Instagram features
like u sername , Caption , Hashtag , Followers , Time_Since_posted , and likes , now your task is
to predict the number of likes and Time Since posted and the rest of the features are
your input features. Now you have to build a model which can predict the
number of likes and Time Since posted.
Dataset This is the Dataset You can use this dataset for this question.

In [17]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.feature_extraction.text import CountVectorizer

# Load the dataset
df = pd.read_csv("instagram_reach.csv")

# Preprocessing
# Handle missing values
df = df.dropna()

# Feature selection/engineering
# Combine alphanumeric values in Caption and Hashtag columns
df['Text'] = df['Caption'].astype(str) + ' ' + df['Hashtags'].astype(str)

# Convert time since posted to numeric representation
def convert_time(time_str):
    if 'minutes' in time_str:
        return int(time_str.split()[0])
    elif 'hours' in time_str:
        return int(time_str.split()[0]) * 60
    elif 'days' in time_str:
        return int(time_str.split()[0]) * 60 * 24
    else:
        return 0  # Handle cases where time is not specified or unknown

df['Time_Since_Posted_Minutes'] = df['Time since posted'].map(convert_time)

# Select relevant features
features = ['Text', 'Followers']
target_likes = 'Likes'
target_time_since_posted = 'Time_Since_Posted_Minutes'

In [18]:
# Split the data into training and testing sets
X_train, X_test, y_train_likes, y_test_likes, y_train_time, y_test_time = train_test_split(
    df[features], df[target_likes], df[target_time_since_posted], test_size=0.2, random_state=42)

In [19]:
# Vectorize the text feature
vectorizer = CountVectorizer()
X_train_text = vectorizer.fit_transform(X_train['Text'])
X_test_text = vectorizer.transform(X_test['Text'])

In [20]:
# Combine vectorized text features with numerical features
X_train_combined = pd.concat([pd.DataFrame(X_train_text.toarray()), X_train['Followers'].reset_index(drop=True)], axis=1)
X_test_combined = pd.concat([pd.DataFrame(X_test_text.toarray()), X_test['Followers'].reset_index(drop=True)], axis=1)

In [21]:
# Convert column names to strings
X_train_combined.columns = X_train_combined.columns.astype(str)
X_test_combined.columns = X_test_combined.columns.astype(str)

In [22]:
# Model training for predicting likes
likes_model = LinearRegression()
likes_model.fit(X_train_combined, y_train_likes)

In [23]:
# Model training for predicting time since posted
time_model = LinearRegression()
time_model.fit(X_train_combined, y_train_time)

# Model evaluation
likes_predictions = likes_model.predict(X_test_combined)
likes_mse = mean_squared_error(y_test_likes, likes_predictions)
likes_mae = mean_absolute_error(y_test_likes, likes_predictions)

time_predictions = time_model.predict(X_test_combined)
time_mse = mean_squared_error(y_test_time, time_predictions)
time_mae = mean_absolute_error(y_test_time, time_predictions)

print("Likes Prediction:")
print("Mean Squared Error (MSE):", likes_mse)
print("Mean Absolute Error (MAE):", likes_mae)

print("Time Since Posted Prediction:")
print("Mean Squared Error (MSE):", time_mse)
print("Mean Absolute Error (MAE):", time_mae)

Likes Prediction:
Mean Squared Error (MSE): 2796.1039055933584
Mean Absolute Error (MAE): 41.64477151011242
Time Since Posted Prediction:
Mean Squared Error (MSE): 45809.67560112746
Mean Absolute Error (MAE): 140.4217466223045


In [24]:
#PREDICT WITH NEW DATA

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Load the trained models
likes_model = LinearRegression()
likes_model.fit(X_train_combined, y_train_likes)

time_model = LinearRegression()
time_model.fit(X_train_combined, y_train_time)

# Create a DataFrame with the new values
new_data = pd.DataFrame({
    'Text': ["Dont forget to TURN ON notification"],
    'Followers': [151]
})

# Combine alphanumeric values in Caption and Hashtag columns
new_data['Text'] = new_data['Text'].astype(str) + ' ' + "#lifestyle#happiness#entrepreneurs#entrepreneurlife#business#working#founder#startup#money#magazine#moneymaker#startuplife#successful#passion#inspiredaily#hardwork#hardworkpaysoff#desire"

# Convert column names to strings
new_data.columns = new_data.columns.astype(str)

# Vectorize the text feature
new_text = vectorizer.transform(new_data['Text'])

# Combine vectorized text features with numerical features
new_data_combined = pd.concat([pd.DataFrame(new_text.toarray()), new_data['Followers'].reset_index(drop=True)], axis=1)

# Convert column names to strings
new_data_combined.columns = new_data_combined.columns.astype(str)

# Make predictions
likes_prediction = likes_model.predict(new_data_combined)
time_prediction = time_model.predict(new_data_combined)

print("Predicted Likes:", likes_prediction)
print("Predicted Time Since Posted (minutes):", time_prediction)


Predicted Likes: [36.79535908]
Predicted Time Since Posted (minutes): [125.20292986]


**Q-2.** Imagine you have a dataset where you have different features like Age ,
Gender , Height , Weight , BMI , and Blood Pressure and you have to classify the people into
different classes like Normal , Overweight , Obesity , Underweight , and Extreme Obesity by using
any 4 different classification algorithms. Now you have to build a model which
can classify people into different classes.
Dataset This is the Dataset You can use this dataset for this question.

In [26]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Load and Preprocess the Dataset
data = pd.read_csv('ObesityDataSet_raw_and_data_sinthetic.csv')
X = data.drop('NObeyesdad', axis=1)
y = data['NObeyesdad']


In [27]:
data.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


In [28]:
# Encode categorical variables
categorical_cols = ['Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 'SMOKE','SCC','CALC', 'MTRANS']
label_encoder = LabelEncoder()
for col in categorical_cols:
    X[col] = label_encoder.fit_transform(X[col])

#Split the Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Build and Train the Models

# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Decision Tree
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

# Support Vector Machines (SVM)
svm = SVC()
svm.fit(X_train, y_train)

# Evaluate and Compare the Models
models = [('Logistic Regression', logreg), ('Decision Tree', dt), ('Random Forest', rf), ('SVM', svm)]
for name, model in models:
    y_pred = model.predict(X_test)
    report = classification_report(y_test, y_pred)
    print(f"Model: {name}")
    print(report)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model: Logistic Regression
                     precision    recall  f1-score   support

Insufficient_Weight       0.76      0.87      0.81        86
      Normal_Weight       0.51      0.42      0.46        93
     Obesity_Type_I       0.59      0.51      0.55       102
    Obesity_Type_II       0.79      0.90      0.84        88
   Obesity_Type_III       0.93      0.99      0.96        98
 Overweight_Level_I       0.50      0.49      0.49        88
Overweight_Level_II       0.43      0.44      0.44        79

           accuracy                           0.66       634
          macro avg       0.65      0.66      0.65       634
       weighted avg       0.65      0.66      0.65       634

Model: Decision Tree
                     precision    recall  f1-score   support

Insufficient_Weight       0.90      0.95      0.93        86
      Normal_Weight       0.83      0.82      0.82        93
     Obesity_Type_I       0.97      0.91      0.94       102
    Obesity_Type_II       0.96   

**Q-3.** Imagine you have a dataset where you have different categories of data, Now
you need to find the most similar data to the given data by using any 4 different
similarity algorithms. Now you have to build a model which can find the most similar
data to the given data.
Dataset This is the Dataset You can use this dataset for this question.

In [29]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances, manhattan_distances

# Load the dataset
data = pd.read_json("News_Category_Dataset_v3.json", lines=True)
data.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [30]:
data = data[['category', 'headline', 'short_description']]
data['text'] = data['headline'] + ' ' + data['short_description']

# Vectorize the text data
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data['text'])

# Function to find the most similar data using different similarity algorithms
def find_similar_data(query, top_n=5):
    # Vectorize the query
    query_vector = vectorizer.transform([query])

    # Calculate similarities using different algorithms
    similarity_scores = []
    algorithms = [cosine_similarity, euclidean_distances, manhattan_distances]
    for algorithm in algorithms:
        sim = algorithm(X, query_vector).flatten()
        similarity_scores.append(sim)

    # Combine similarities from different algorithms
    similarity_scores = sum(similarity_scores) / len(similarity_scores)

    # Find the indices of top similar data points
    top_indices = similarity_scores.argsort()[-top_n:][::-1]

    # Return the top similar data points
    similar_data = data.iloc[top_indices]

    return similar_data

In [38]:
# Find similar News
query = "President Joe Biden and first lady Jill Biden"
similar_data = find_similar_data(query)
similar_data

Unnamed: 0,category,headline,short_description,text
109802,WORLDPOST,Weekend Roundup: Laughing at God,The first principle of an open society is not ...,Weekend Roundup: Laughing at God The first pri...
66816,POLITICS,Sunday Roundup,This week the nation watched as the #NeverTrum...,Sunday Roundup This week the nation watched as...
63109,POLITICS,Sunday Roundup,"This week, the nation was reminded, in ways bo...","Sunday Roundup This week, the nation was remin..."
107893,POLITICS,Sunday Roundup,"This week began with ""The Horrible Call"" final...","Sunday Roundup This week began with ""The Horri..."
72892,POLITICS,Sunday Roundup,This week the GOP debate circus pulled into Mi...,Sunday Roundup This week the GOP debate circus...


**Q-4.** Imagine you working as a sale manager now you need to predict the Revenue
and whether that particular revenue is on the weekend or not and find the
Informational_Duration using the Ensemble learning algorithm
Dataset This is the Dataset You can use this dataset for this question.

In [39]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score

# Load the dataset
df = pd.read_csv('online_shoppers_intention.csv')

# Preprocess the data
# Drop any rows with missing values
df.dropna(inplace=True)

# Encode categorical variables
df = pd.get_dummies(df)

# Split the data into features (X) and target variable (y)
X = df.drop(['Revenue', 'Weekend', 'Informational_Duration'], axis=1)
y_revenue = df['Revenue']
y_weekend = df['Weekend']
y_info_duration = df['Informational_Duration']

# Split the data into training and testing sets
X_train, X_test, y_revenue_train, y_revenue_test, y_weekend_train, y_weekend_test, y_info_duration_train, y_info_duration_test = train_test_split(
    X, y_revenue, y_weekend, y_info_duration, test_size=0.2, random_state=42)

# Train the Random Forest model for revenue prediction
revenue_model = RandomForestClassifier(n_estimators=100, random_state=42)
revenue_model.fit(X_train, y_revenue_train)

# Predict revenue on the test set
revenue_predictions = revenue_model.predict(X_test)

# Train the Random Forest model for weekend prediction
weekend_model = RandomForestClassifier(n_estimators=100, random_state=42)
weekend_model.fit(X_train, y_weekend_train)

# Predict weekend on the test set
weekend_predictions = weekend_model.predict(X_test)

# Train the Random Forest model for informational duration prediction
info_duration_model = RandomForestRegressor(n_estimators=100, random_state=42)
info_duration_model.fit(X_train, y_info_duration_train)

# Predict informational duration on the test set
info_duration_predictions = info_duration_model.predict(X_test)

# Evaluate the models
revenue_accuracy = accuracy_score(y_revenue_test, revenue_predictions)
weekend_accuracy = accuracy_score(y_weekend_test, weekend_predictions)

print("Revenue Accuracy:", revenue_accuracy)
print("Weekend Accuracy:", weekend_accuracy)

Revenue Accuracy: 0.8913219789132197
Weekend Accuracy: 0.7619626926196269


**Q-5.** Uber is a taxi service provider as we know, we need to predict the high
booking area using an Unsupervised algorithm and price for the location using a
supervised algorithm and use some map function to display the data
Dataset This is the Dataset You can use this dataset for this question.

In [1]:
import pandas as pd
import numpy as np
import folium
from folium import plugins
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Step 1: Data preprocessing
data = pd.read_csv('rideshare_kaggle.csv')  # Load the dataset
# Perform data cleaning and preprocessing as needed

# Step 2: Unsupervised algorithm for high booking areas
# Select relevant columns for clustering
location_data = data[['latitude', 'longitude']]
# Perform clustering using K-means algorithm
kmeans = KMeans(n_clusters=5)  # Set the number of clusters as desired
clusters = kmeans.fit_predict(location_data)
# Add cluster labels to the dataset
data['cluster_label'] = clusters

# Step 3: Supervised algorithm for price prediction
# Select relevant features for price prediction
features = ['latitude', 'longitude', 'distance', 'surge_multiplier', 'temperature', 'humidity']
X = data[features]
y = data['price']
# Handle missing values in y
y = y.fillna(y.mean())  # Replace NaN values with the mean of y
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)



In [2]:
map = folium.Map(location=[data['latitude'].mean(), data['longitude'].mean()], zoom_start=10)

# Add markers for high booking areas
for index, row in data.iterrows():
    folium.Marker(location=[row['latitude'], row['longitude']], 
                  popup=f"Cluster: {row['cluster_label']}").add_to(map)

# Add predicted prices as a heatmap
heat_data = data[['latitude', 'longitude', 'price']]
heat_data['predicted_price'] = model.predict(data[features])
heat_data = heat_data.dropna(subset=['predicted_price'])

heat_data = heat_data.groupby(['latitude', 'longitude'])['predicted_price'].mean().reset_index().values.tolist()

folium.TileLayer('cartodbpositron').add_to(map)  # Add tile layer for better visualization
folium.plugins.HeatMap(heat_data).add_to(map)

# Save the map to an HTML file
map.save('booking_map.html')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  heat_data['predicted_price'] = model.predict(data[features])


**Q-6.** Imagine you have a dataset where you have predicted loan Eligibility using any
4 different classification algorithms. Now you have to build a model which can
predict loan Eligibility and you need to find the accuracy of the model and built-in
docker and use some library to display that in frontend
Dataset This is the Dataset You can use this dataset for this question.

In [None]:
PRIVATE

**Q-7.** Imagine you have a dataset where you need to predict the Genres of Music
using
an Unsupervised algorithm and you need to find the accuracy of the model, built-in
docker, and use some library to display that in frontend
Dataset This is the Dataset You can use this dataset for this question.

In [3]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Load the music dataset
data_1 = pd.read_csv('data.csv')
data_2 = pd.read_csv('data_2genre.csv')

data = pd.concat([data_1, data_2])



# Extract the features and drop irrelevant columns
features = data.drop(['filename', 'label'], axis=1)

# Scale the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

# Apply K-means clustering
kmeans = KMeans(n_clusters=5)  # Set the desired number of clusters
kmeans.fit(scaled_features)

# Get the predicted cluster labels
predicted_labels = kmeans.labels_

# Evaluate the clustering performance
silhouette = silhouette_score(scaled_features, predicted_labels)

# Add the predicted labels to the dataset
data['predicted_label'] = predicted_labels

# Display the clustering results
print('Clustering Results:')
print(data[['label', 'predicted_label']])

# Display the silhouette score
print('Silhouette Score:', silhouette)



Clustering Results:
     label  predicted_label
0    blues                3
1    blues                0
2    blues                4
3    blues                4
4    blues                1
..     ...              ...
195      2                0
196      2                0
197      2                4
198      2                0
199      2                0

[1200 rows x 2 columns]
Silhouette Score: 0.20448890546374038


**Q-8.** Quora question pair similarity, you need to find the Similarity between two
questions by mapping the words in the questions using TF-IDF, and using a supervised
Algorithm you need to find the similarity between the questions.
Dataset This is the Dataset You can use this dataset for this question.

In [None]:
UNABLE TO DOWNLOAD DATA FROM KAGGLE

**Q-9.** A cyber security agent wants to check the Microsoft Malware so need he came
to you as a Machine learning Engineering with Data, You need to find the Malware
using a supervised algorithm and you need to find the accuracy of the model.
Dataset This is the Dataset You can use this dataset for this question.

In [None]:
UNABLE TO DOWNLOAD DATA FROM KAGGLE

**Q-10.** An Ad- Agency analyzed a dataset of online ads and used a machine learning
model to predict whether a user would click on an ad or not.
Dataset This is the Dataset You can use this dataset for this question.

In [None]:
UNABLE TO DOWNLOAD DATA FROM KAGGLE