#  Loading some Required Libraries

In [40]:
import pandas as pd
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
import joblib

# Step 1: Load Data

In [2]:
# Load the data
data = pd.read_csv('gender_prediction_frm_txt.csv')

In [3]:
data.sample(10)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,gender,description
7475,7475,9736,female,"Life is about dreams, hopes, and goals. Mine i..."
9761,9761,12400,female,be happy
11490,11490,14394,unknown,#OpenDM / #Bi / #OpenRP / #YaoiRP / #OpenDM / ...
11583,11583,14502,male,Desperate for a creative bio
640,640,735,unknown,i hate @kyosukekiyo // according to a possible...
4418,4418,5167,female,Mayson's mommy 10.13.15 _ü In love with T....
15311,15311,18835,male,unbothered
8435,8435,10881,male,Proud National Socialist - Cis white male - Goy
10912,10912,13708,male,just one girl who loves justin bieber
8702,8702,11200,male,"Like our hands, our hearts."


## Step 2:  Drop rows with NaN values in the text column

### Handling NaN Values in the Text Column

#### Definition of NaN Values
NaN stands for "Not a Number" and is used to represent missing or undefined values in a dataset. In the context of text data, NaN values indicate that a particular entry in the text column is missing or empty.

#### Purpose of Dropping Rows with NaN Values
Dropping rows with NaN values is an essential pre-processing step to ensure that the dataset is clean and complete. Working with incomplete data can lead to errors and unreliable results in text analysis and machine learning models. By removing rows with NaN values, we can:

1. **Ensure Data Quality**: Removing incomplete data entries helps maintain the integrity and quality of the dataset.
2. **Prevent Errors**: Many text processing functions and machine learning algorithms cannot handle NaN values and will raise errors if they encounter them.
3. **Improve Model Performance**: Clean and complete data contributes to more accurate and reliable model performance.


In [5]:
data_dropped = data.dropna()

In [6]:
print(len(data_dropped))

16224


In [7]:
data.drop(['Unnamed: 0.1','Unnamed: 0'],axis=1)

Unnamed: 0,gender,description
0,male,i sing my own rhythm.
1,male,I'm the author of novels filled with family dr...
2,male,louis whining and squealing and all
3,male,"Mobile guy. 49ers, Shazam, Google, Kleiner Pe..."
4,female,Ricky Wilson The Best FRONTMAN/Kaiser Chiefs T...
...,...,...
16219,female,(rp)
16220,male,"Whatever you like, it's not a problem at all. ..."
16221,male,#TeamBarcelona ..You look lost so you should f...
16222,female,Anti-statist; I homeschool my kids. Aspiring t...


# Step 3 Data Cleaning Functions

We'll create three separate data cleaning functions to preprocess the ``'description'`` column. <br> These functions will handle different aspects of text cleaning: ``removing symbols and numbers``, converting text to ``lowercase``, and removing ``stopwords``. <br>We will also create a clean_text function that applies all three cleaning steps for simplicity.

## Function 1: remove_symbols_numbers

This function removes all ``non-alphabetic`` characters (symbols and numbers) from the input text, retaining only alphabetic characters and spaces.

In [13]:
def remove_symbols_numbers(text):
    return re.sub(r'[^a-zA-Z\s]', '', text)

## Function 2: to_lowercase

This function converts all characters in the input text to ``lowercase``.

In [14]:
def to_lowercase(text):
    return text.lower()

## Function 3: remove_stopwords

This function removes common stopwords from the input text. Stopwords are frequently used words in a language that carry little meaning ``(e.g., "and", "the", "is")``.

In [16]:
def remove_stopwords(text):
    words = text.split()
    words = [word for word in words if word not in ENGLISH_STOP_WORDS]
    return ' '.join(words)

## Combined Function: clean_text

This function applies all three cleaning steps to the input text.

In [18]:
def clean_text(text):
    text = remove_symbols_numbers(text)
    text = to_lowercase(text)
    text = remove_stopwords(text)
    return text

## Applying the Cleaning Functions

We will apply the clean_text function to the ``'description'`` column of the dataset.

In [19]:
# Apply the cleaning function to the 'description' column
data['description'] = data['description'].apply(clean_text)

### Select 50 male and 50 female instances

In [22]:
male_data = data[data['gender'] == 'male'].sample(n=50, random_state=42)
female_data = data[data['gender'] == 'female'].sample(n=50, random_state=42)
selected_data = pd.concat([male_data, female_data]).reset_index(drop=True)

# Step 4: Vectorization

After cleaning the text data, we convert the ``'description'`` column into numerical features using ``TF-IDF(Term Frequency-Inverse Document Frequency`` vectorization. This step transforms the text data into a format suitable for machine learning models.

## TF-IDF Vectorization

To convert text data into numerical features by calculating the ``TF-IDF`` score for each word in the text. TF-IDF helps in giving importance to words that are frequent in a document but not across all documents.

In [26]:
vectorizer = TfidfVectorizer()

# Apply TF-IDF on 'description' column
X = vectorizer.fit_transform(selected_data['description'])

# The target variable
y = selected_data['gender']

 The result is a sparse matrix X where each row represents a description and each column represents a unique word in the corpus. The values in the matrix are the TF-IDF scores of the words.

# Step 5: Model Training and Evaluation

We train and evaluate a machine learning model to predict the gender based on the ``TF-IDF`` vectors. Here, we use ``Logistic Regression`` as an example.

## Train-Test Split

To split the data into training and testing sets. The ``training set`` is used to train the model, and the ``testing set`` is used to evaluate the model's performance on unseen data.

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Model Training

To train a machine learning model on the training data.

In [28]:
model = LogisticRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Step 7: Model Evaluation

To evaluate the performance.

In [29]:
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Logistic Regression Accuracy: 0.45
              precision    recall  f1-score   support

      female       0.42      1.00      0.59         8
        male       1.00      0.08      0.15        12

    accuracy                           0.45        20
   macro avg       0.71      0.54      0.37        20
weighted avg       0.77      0.45      0.33        20



## Applying different models and testing who is performing better

# Naive Bayes model

In [41]:
nb_model = MultinomialNB()


nb_model.fit(X_train, y_train)

y_pred_nb = nb_model.predict(X_test)

# Evaluate the model
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))


Naive Bayes Accuracy: 0.55
              precision    recall  f1-score   support

      female       0.47      0.88      0.61         8
        male       0.80      0.33      0.47        12

    accuracy                           0.55        20
   macro avg       0.63      0.60      0.54        20
weighted avg       0.67      0.55      0.53        20



# Decision Tree model

In [42]:
dt_model = DecisionTreeClassifier()

# Train the model
dt_model.fit(X_train, y_train)

# Predict on the test set
y_pred_dt = dt_model.predict(X_test)

# Evaluate the model
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))

Decision Tree Accuracy: 0.4
              precision    recall  f1-score   support

      female       0.39      0.88      0.54         8
        male       0.50      0.08      0.14        12

    accuracy                           0.40        20
   macro avg       0.44      0.48      0.34        20
weighted avg       0.46      0.40      0.30        20



# Random Forest model

In [43]:
rf_model = RandomForestClassifier()

# Train the model
rf_model.fit(X_train, y_train)

# Predict on the test set
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))


Random Forest Accuracy: 0.4
              precision    recall  f1-score   support

      female       0.40      1.00      0.57         8
        male       0.00      0.00      0.00        12

    accuracy                           0.40        20
   macro avg       0.20      0.50      0.29        20
weighted avg       0.16      0.40      0.23        20



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


 # SVM model

In [44]:
svm_model = SVC()

# Train the model
svm_model.fit(X_train, y_train)

# Predict on the test set
y_pred_svm = svm_model.predict(X_test)

# Evaluate the model
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))


SVM Accuracy: 0.4
              precision    recall  f1-score   support

      female       0.40      1.00      0.57         8
        male       0.00      0.00      0.00        12

    accuracy                           0.40        20
   macro avg       0.20      0.50      0.29        20
weighted avg       0.16      0.40      0.23        20



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# KNN (K Nearest Neighbors) model

In [45]:
knn_model = KNeighborsClassifier()

# Train the model
knn_model.fit(X_train, y_train)

# Predict on the test set
y_pred_knn = knn_model.predict(X_test)

# Evaluate the model
print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))


KNN Accuracy: 0.7
              precision    recall  f1-score   support

      female       0.58      0.88      0.70         8
        male       0.88      0.58      0.70        12

    accuracy                           0.70        20
   macro avg       0.73      0.73      0.70        20
weighted avg       0.76      0.70      0.70        20



## We can see in our nature of dataset the KNN model give us high result.
So, this model use for real time prediction.

In [46]:
# Save the best model KNN
joblib.dump(knn_model, 'best_gender_prediction_model_knn.pkl')

# Save the TF-IDF vectorizer
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')


['tfidf_vectorizer.pkl']

## Load the Model and Vectorizer for Real-Time Predictions

Now, let's write the code to load the saved model and vectorizer, and then get real-time predictions based on user input:

In [49]:
import joblib

# Load the trained model
model = joblib.load('best_gender_prediction_model_knn.pkl')

# Load the TF-IDF vectorizer
vectorizer = joblib.load('tfidf_vectorizer.pkl')

In [50]:
def predict_gender(description):
    # Clean the input description
    cleaned_description = clean_text(description)

    # Transform the input description using the trained TF-IDF vectorizer
    description_tfidf = vectorizer.transform([cleaned_description])

    # Predict gender using the trained model
    prediction = model.predict(description_tfidf)

    return prediction[0]

In [51]:
# Get input description from the user
new_description = input("Enter a description: ")

# Predict the gender
predicted_gender = predict_gender(new_description)
print("Predicted Gender:", predicted_gender)


Predicted Gender: male


## ---- Happy Learning & Always Keep Smile ----

# ---- Jazakallah Khair ----