<a href="https://colab.research.google.com/github/pablillo77/nlp_and_deep_learning/blob/main/DS_NLP_DeepLearning_Final_Pablo_Gim%C3%A9nez.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a id="1"></a>
# <div style="padding:20px;color:white;margin:0;font-size:35px;font-family:Verdana;text-align:center;display:fill;border-radius:5px;background-color:#254E58;overflow:hidden"><b> 🎬 Introduction 🎬</b></div>

<a id="1"></a>
# <div style="padding:20px;color:white;margin:0;font-size:35px;font-family:Verdana;text-align:center;display:fill;border-radius:5px;background-color:#254E58;overflow:hidden"><b>🧹 Preliminary data cleaning 🧹</b></div>

## Presentation ##
The present project is based on a Kaggle dataset about BBC articles. The objective is to train and produce an algorithm for accurate news classification into five categories, that could yield commericial applications, customizing user engagement and giving insights for targeted audiences.

## Audience ##

This includes both readers seeking relevant content and businesses interested in market research, who could use categorized news for industry insights.

## Comercial Context ##

Personalized content delivery could improve user satisfaction and time spent on the platform. The classified data could also be valuable to marketing advertisers for audience targeting.

## Key Hypotheses and Questions ##

- Effective categorization and personalized news recommendations enhance reader engagement and retention.
- How the categories are distributed?
- Can we predict categories in new content?

## Objectives ##

- Produce an accurate classification model using LSTM and or RNN.
- Acquire category-based insights.
- Set the bases for future work on trending category predictions.





 </b></div>
# <div style="padding:20px;color:white;margin:0;font-size:35px;font-family:Verdana;text-align:center;display:fill;border-radius:5px;background-color:#254E58;overflow:hidden"><b>⏳Libraries imports and data upload⏳
 </b></div>

In [None]:
#import libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import nltk
import seaborn as sns
import re
import string
import random
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from collections import Counter
from wordcloud import WordCloud
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# Download necessary NLTK data

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
#dataset load from github

#https://www.kaggle.com/code/dnkumars/lstm-model-bbc-articles-dataset/notebook?select=bbc_text_cls.csv
url = 'https://raw.githubusercontent.com/pablillo77/nlp_and_deep_learning/refs/heads/main/bbc_text_cls.csv'
df = pd.read_csv(url)
df.shape

(2225, 2)

In [None]:
pd.set_option('display.max_colwidth',None)
df.sample(2)

Unnamed: 0,text,labels
1072,"Plaid MP's cottage arson claim\n\nA Plaid Cymru MP believes UK security services were involved in some arson attacks blamed on Welsh extremists.\n\nIt is 25 years since the start of 12 years of fire-bombings, attributed to a shadowy group known as Meibion Glyndwr. Plaid Cymru's Elfyn Llwyd has suggested the security services could have been involved, with the intention of discrediting the nationalist vote. Ex-Welsh Office Minister Lord Roberts of Conwy denied security services were involved. In March this year, North Wales Police reopened the case, saying materials kept during their investigations would be examined to find whether it would yield DNA evidence.\n\nMeibion Glyndwr - which means ""sons of Glyndwr"" - began burning property in December 1979 in protest at homes in rural Wales being sold as holiday cottages to people from England. The group was linked to most of the 220 or so fire-bombing incidents stretching from the Llyn Peninsula to Pembrokeshire. The campaign continued until the early 1990s. Police were accused in some quarters of targeting anyone who was a nationalist. Although one man, Sion Aubrey Roberts, was convicted in 1993 of sending letter bombs in the post, the arson cases remain unsolved.\n\nAs a solicitor, Elfyn Llwyd represented Welsh singer Bryn F&#244;n when he was arrested on suspicion of being involved in the arson campaign. F&#244;n was released without charge . But now, as MP for Merionnydd Nant Conwy and Plaid Cymru's Parliamentary Leader, Mr Llwyd has argued that some of the terror attacks may have had the involvement of the security services and not Meibion Glyndwr. He believes that elements of the British security services may have carried out renegade actions in order to discredit Plaid Cymru and the nationalist vote ahead of elections. The claim is made in an interview for BBC Wales' Maniffesto programme to be shown on S4C on Sunday.\n\nMr Llwyd said that the sophistication of many of the devices used in the attacks compared to the crude nature of many others, suggests a degree of professionalism which could only have come from individuals who knew exactly what they were doing. He said: ""What I'm saying is that the role that they took wasn't the appropriate one, i.e. like an\n\nagent provocateur\n\nand perhaps interfering and creating a situation where it looked like it was the nationalists that were responsible."" The programme also heard from Lord Roberts of Conwy, who was a Welsh Office minister at the time. He denied that the security services played any improper role. Mr Llwyd's theory has also been questioned by Plaid Cymru's former President, Dafydd Wigley. He accepted that the fires damaged Plaid Cymru's public image but believed that the security services had their hands full at the time with the IRA and animal rights activists.\n\n\n - Maniffesto can be seen on S4C on Sunday, 12 December, at 1200 GMT.",politics
1991,"A decade of good website design\n\nThe web looks very different today than it did 10 years ago.\n\nBack in 1994, Yahoo had only just launched, most websites were text-based and Amazon, Google and eBay had yet to appear. But, says usability guru Dr Jakob Nielsen, some things have stayed constant in that decade, namely the principles of what makes a site easy to use. Dr Nielsen has looked back at a decade of work on usability and considered whether the 34 core guidelines drawn up back then are relevant to the web of today. ""Roughly 80% of the things we found 10 years ago are still an issue today,"" he said. ""Some have gone away because users have changed and 10% have changed because technology has changed.""\n\nSome design crimes, such as splash screens that get between a user and the site they are trying to visit, and web designers indulging their artistic urges have almost disappeared, said Dr Nielsen.\n\n""But there's great stability on usability concerns,"" he told the BBC News website. Dr Nielsen said the basic principles of usability, centring around ease of use and clear thinking about a site's total design, were as important as ever. ""It's necessary to be aware of these things as issues because they remain as such,"" he said. They are still important because the net has not changed as much as people thought it would. ""A lot of people thought that design and usability was only a temporary problem because broadband was taking off,"" he said. ""But there are a very small number of cases where usability issues go away because you have broadband.""\n\nDr Nielsen said the success of sites such as Google, Amazon, eBay and Yahoo showed that close attention to design and user needs was important. ""Those four sites are extremely profitable and extremely successful,"" said Dr Nielsen, adding that they have largely defined commercial success on the net.\n\n""All are based on user empowerment and make it easy for people to do things on the internet,"" he said. ""They are making simple but powerful tools available to the user. ""None of them have a fancy or glamorous look,"" he added, declaring himself surprised that these sites have not been more widely copied. In the future, Dr Nielsen believes that search engines will play an even bigger part in helping people get to grips with the huge amount of information online. ""They are becoming like the operating system to the internet,"" he said. But, he said, the fact that they are useful now does not meant that they could not do better. Currently, he said, search sites did not do a very good job of describing the information that they return in response to queries. Often people had to look at a website just to judge whether it was useful or not. Tools that watch the behaviour of people on websites to see what they actually find useful could also help refine results. Research by Dr Nielsen shows that people are getting more sophisticated in their use of search engines. The latest statistics on how many words people use on search engines shows that, on average, they use 2.2 terms. In 1994 only 1.3 words were used. ""I think it's amazing that we have seen a doubling in a 10-year period of those search terms,"" said Dr Nielsen.\n\nYou can hear more from Jakob Nielsen and web design on the BBC World Service programme, Go Digital",tech
