# CLASS SESSION GROUP ASSIGNMENT

> Use the provided dataset ([Social_Media_Usage.csv](Social_Media_Usage.csv)) which classifies the social media platforms used by males and females of different ages
> 
> Generate a machine learning model to predict platforms used by a 21yr old female and a 32yr old male.

#### Group 2 Members:
1. Kasasira Joshua
2. Racheal Econia
3. Nampijja Betty 
4. Nantaba Ziria Phionah 
5. Charles Jovans Galiwango
6. Daniel Ongom

### Import libraries

In [192]:
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

%matplotlib inline

### Read dataset and get information from data

In [193]:
# Load the dataset
df = pd.read_csv("Social_Media_Usage.csv")

In [194]:
## Check the variable Datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   age       18 non-null     int64 
 1   gender    18 non-null     object
 2   platform  18 non-null     object
dtypes: int64(1), object(2)
memory usage: 564.0+ bytes


In [208]:
## Check the variable Datatypes
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,18.0,28.055556,5.307843,20.0,25.0,28.0,31.0,37.0
gender,18.0,0.5,0.514496,0.0,0.0,0.5,1.0,1.0
platform,18.0,1.833333,1.098127,0.0,1.0,2.0,3.0,3.0


In [196]:
# Check for missing values
missing_info = df.isnull().sum()
missing_info

age         0
gender      0
platform    0
dtype: int64

### Encode the dataset

In [197]:
def encode_dataset(dataset):
  encoded_dataset = dataset
  
  # Encode categorical variables
  for column in dataset.select_dtypes(include=['object']).columns:
      encoded_dataset[column] = dataset[column].astype('category').cat.codes
      
  return encoded_dataset

In [198]:
# Encode the dataset
encoded_df = encode_dataset(df)
encoded_df

Unnamed: 0,age,gender,platform
0,20,0,2
1,23,0,2
2,25,0,2
3,26,0,1
4,29,0,1
5,30,0,1
6,31,0,3
7,33,0,3
8,37,0,3
9,20,1,2


### Splitting Dataset

In [199]:
X = encoded_df.drop('platform', axis=1) # Features (independent variables)
y = encoded_df['platform'] # Target variable (dependent variable)

# # Splitting the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Create the model

In [200]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

In [201]:
def map_back_to_labels(predictions):
  # Label Mapping
  label_mapping = {0: 'facebook', 1: 'snapchat', 2: 'tiktok', 3: 'twitter', }

  # Map the integer predictions to platform names
  interpreted_predictions = [label_mapping[label] for label in predictions]
  
  return interpreted_predictions

In [202]:
predictions = model.predict(X_test)
predictions

print(map_back_to_labels(predictions))

['tiktok', 'tiktok', 'facebook', 'snapchat']


### Model Evaluation

In [203]:
print("Accuracy:", accuracy_score(y_test, predictions))

Accuracy: 0.75


### Predict platforms used by a 21yr old female and a 32yr old male.

In [206]:
female_21_predictions = model.predict([[21, 0]])
male_32_predictions = model.predict([[32, 1]])

# Display the predictions
print("21yr old female will likely use:", map_back_to_labels(female_21_predictions))
print("32yr old male will likely use:", map_back_to_labels(male_32_predictions))
print("\n\n")

21yr old female will likely use: ['tiktok']
32yr old male will likely use: ['facebook']





