## __Unveiling Insights: Analyzing Trends and Patterns Across Diverse News Categories__

<center>
  <img src="https://cdn.pixabay.com/photo/2018/06/21/16/32/newspapers-3488861_1280.jpg" width="500"/>
</center>


# Table of Contents
1. [Introduction](#1.Introduction)
   1.1. [Problem Statement](#1.1.Problem-Statement)  
   1.2. [Objectives](#1.2.Objectives)  
2. [Importing Packages](#2.Importing-Packages)  
3. [Data loading and Inspection](#3.Data-loading-and-Inspection)    
4. [Data Cleaning](#4.Data-Cleaning)  
5. [Exploratory Data Analysis(EDA)](#5.Exploratory-Data-Analysis(EDA))
6. [Data Preprocessing](#6.Data-Preprocessing)
7. [Model development](#7.Model-development)
8. [Model Evaluation](#8.Model-Evaluation)
9. [Model Deployment](#9.Model-Deployment)
10. [Conclusion and Recommendations](#10.Conclusion-and-Recommendations)
11. [References](#11.References)

### **1.Introduction**

#### *1.1.Problem Statement*

#### *1.2.Objectives*

### **2.Importing packages**

#### *2.1.Basic packages*

In [2]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

#### *2.2.Model development packages*

#### *2.3.Model evaluation packages*

### **3.Data loading and Inspection**

#### ***3.1.Data loading***

In [3]:
# url of the csv file
train_url = "https://github.com/DareSandtech/2407FTDS_Classification_Project/raw/refs/heads/main/Data/processed/train.csv"
test_url = "https://github.com/DareSandtech/2407FTDS_Classification_Project/raw/refs/heads/main/Data/processed/test.csv"

# load train data
train_df = pd.read_csv(train_url)

# load test data
test_df = pd.read_csv(test_url)

#### ***3.2.Data Inspection***

##### *3.2.1.Data Overview*

To begin, we will first look at the overview of our datasets and inspect their shapes to understand the structure of the data.

In [4]:
# Check the shape of the DataFrame
print("Shape of the train dataset:", train_df.shape)
print("Shape of the test dataset:", test_df.shape)

Shape of the train dataset: (5520, 5)
Shape of the test dataset: (2000, 5)


In [5]:
# Display the first 5 rows of the train_df to get a quick overview of the data
train_df.head()

Unnamed: 0,headlines,description,content,url,category
0,RBI revises definition of politically-exposed ...,The central bank has also asked chairpersons a...,The Reserve Bank of India (RBI) has changed th...,https://indianexpress.com/article/business/ban...,business
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,NDTV's consolidated revenue from operations wa...,Broadcaster New Delhi Television Ltd on Monday...,https://indianexpress.com/article/business/com...,business
2,"Akasa Air ‘well capitalised’, can grow much fa...",The initial share sale will be open for public...,Homegrown server maker Netweb Technologies Ind...,https://indianexpress.com/article/business/mar...,business
3,India’s current account deficit declines sharp...,The current account deficit (CAD) was 3.8 per ...,India’s current account deficit declined sharp...,https://indianexpress.com/article/business/eco...,business
4,"States borrowing cost soars to 7.68%, highest ...",The prices shot up reflecting the overall high...,States have been forced to pay through their n...,https://indianexpress.com/article/business/eco...,business


In [6]:
# Display the first 5 rows of test_df to get a quick overview of the data
test_df.head()

Unnamed: 0,headlines,description,content,url,category
0,NLC India wins contract for power supply to Ra...,State-owned firm NLC India Ltd (NLCIL) on Mond...,State-owned firm NLC India Ltd (NLCIL) on Mond...,https://indianexpress.com/article/business/com...,business
1,SBI Clerk prelims exams dates announced; admit...,SBI Clerk Prelims Exam: The SBI Clerk prelims ...,SBI Clerk Prelims Exam: The State Bank of Indi...,https://indianexpress.com/article/education/sb...,education
2,"Golden Globes: Michelle Yeoh, Will Ferrell, An...","Barbie is the top nominee this year, followed ...","Michelle Yeoh, Will Ferrell, Angela Bassett an...",https://indianexpress.com/article/entertainmen...,entertainment
3,"OnePlus Nord 3 at Rs 27,999 as part of new pri...",New deal makes the OnePlus Nord 3 an easy purc...,"In our review of the OnePlus Nord 3 5G, we pra...",https://indianexpress.com/article/technology/t...,technology
4,Adani family’s partners used ‘opaque’ funds to...,Citing review of files from multiple tax haven...,Millions of dollars were invested in some publ...,https://indianexpress.com/article/business/ada...,business


##### *3.2.2.check for column naming conventions*

In [7]:
# Function to Verify If Column Names Adhere to the Correct Naming Convention
def check_col_naming_convention(df):
    """
    This function checks if the column names in the given DataFrame follow the 
    Capitalized Snake Case convention (e.g., 'Order_Date', 'Customer_Name').
    
    Args:
    df (pandas.DataFrame): The DataFrame to check.
    
    Returns:
    dict: A dictionary containing two lists:
          - 'compliant': Columns that follow Capitalized Snake Case.
          - 'non_compliant': Columns that do not follow Capitalized Snake Case.
    """
    
    # Regular expression to check for Capitalized Snake Case
    def is_capitalized_snake_case(col):
        return bool(re.match(r'^[A-Z][a-z0-9]*(_[A-Z][a-z0-9]*)*$', col))
    
    # Check all columns
    compliant_columns = [col for col in df.columns if is_capitalized_snake_case(col)]
    non_compliant_columns = [col for col in df.columns if not is_capitalized_snake_case(col)]
    
    return {
        'compliant': compliant_columns,
        'non_compliant': non_compliant_columns
    }

In [8]:
# Applying the 'check_col_naming_convention() function' on the datasets to see which column names are not compliant
print('Check column naming conventions for train_df')
print(check_col_naming_convention(train_df))
print()
print('Check column naming conventions for test_df')
print(check_col_naming_convention(test_df))

Check column naming conventions for train_df
{'compliant': [], 'non_compliant': ['headlines', 'description', 'content', 'url', 'category']}

Check column naming conventions for test_df
{'compliant': [], 'non_compliant': ['headlines', 'description', 'content', 'url', 'category']}


    - It seems that none of our column names are compliant, as they all start with lowercase letters, which is inconsistent with the standard naming conventions 

##### *3.2.3.Check for missing entries*

In [9]:
def check_missing_vals(df):
    """
    Checks for missing (null) values in each column of the provided DataFrame.
    
    Parameters:
    df (DataFrame): The pandas DataFrame to check for missing values.
    
    Returns:
    Series: A pandas Series where each element represents the count of missing 
            values in the corresponding column of the DataFrame.
    """
    # This function checks for missing entries in each column of the DataFrame 
    # and returns the count of missing entries in each column
    return df.isnull().sum()

# Printing the missing entries for the training dataset (train_df)
print('Train_df\n\n', check_missing_vals(train_df))

print('_' * 20)

# Printing the missing entries for the test dataset (test_df)
print('Test_df\n\n', check_missing_vals(test_df))

Train_df

 headlines      0
description    0
content        0
url            0
category       0
dtype: int64
____________________
Test_df

 headlines      0
description    0
content        0
url            0
category       0
dtype: int64


    - The datasets do not contain any missing entries, meaning that all entries in the datasets are complete and there are no null or NaN values present in any of the columns or rows

##### *3.2.4.Check for duplicate rows*

In [11]:
def check_dup(df):
    """
    Checks for duplicate rows in the provided DataFrame.
    
    Parameters:
    df (DataFrame): The pandas DataFrame to check for duplicate rows.
    
    Returns:
    int: The total number of duplicate rows in the DataFrame.
    """
    # This function checks for duplicate rows in the DataFrame and returns
    # the count of duplicate rows found
    return df.duplicated().sum()

# Printing the duplicate entries for the training dataset (train_df)
print('Train_df\n\n', check_dup(train_df))

print('_' * 20)

# Printing the duplicate entries for the test dataset (test_df)
print('Test_df\n\n', check_dup(test_df))

Train_df

 0
____________________
Test_df

 0


    - The datasets do not contain any duplicate entries, all rows in the datasets are unique and there are no repeated records present in any of the columns or rows

#### __Observations__

- Upon reviewing the column naming conventions, we found that none of the columns adhere to the standard naming conventions
- The datasets are complete and do not contain any missing entries
- The datasets are free from duplicate entries


All identified inconsistencies will be handled in the data cleaning section to ensure that the dataset conforms to the necessary standards for analysis and modeling


### **4.Data cleaning**

### **5.Exploratory Data Analysis(EDA)**

### **6.Data Preprocessing**

### **7.Model development**

### **8.Model Evaluation**

### **9.Model Deployment**

### **10.Conclusion and Recommendations**

### **11.References**