# The news website used for scraping the articles is washingtonpost.com. This website consist of 4 important sections (politics,style,climate,tech)

**Categories**
1.  politics
2.  style
3.  climate
3.  money

* We could use beautifulsoup for scraping articles from the websites but since beautifulsoup requires articles in stuctured format (HTML) this approch wont work for most of the websites
* I have use a python Newspaper3k for scraping articles from the websites. This library does not require articles in stuctured format.

### For all the 4 sections we are separately scraping the data 

In [None]:
!pip3 install newspaper3k

In [None]:
import nltk
nltk.download('punkt')

### Business Data Scrapping
---



In [None]:
import newspaper
import pandas as pd
from tqdm import tqdm

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
news_paper = newspaper.build("https://www.washingtonpost.com/politics/", memoize_articles=False, user_agent=user_agent)



l = []
for article in tqdm(news_paper.articles[:200], desc="Processing articles"):
    article.download()
    article.parse()
    article.nlp()
    keywords = ' '.join(article.keywords)
    l.append({"Title": article.title, "Category": 1, "Date": article.publish_date, "Keywords": keywords, "URL": article.url})

df = pd.DataFrame(l)
df.to_csv("politics.csv")

### Tech Data Scraping
---



In [None]:
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
news_paper = newspaper.build("https://www.washingtonpost.com/style/", memoize_articles=False, user_agent=user_agent)



l = []
for article in tqdm(news_paper.articles[:200], desc="Processing articles"):
    article.download()
    article.parse()
    article.nlp()
    keywords = ' '.join(article.keywords)
    l.append({"Title": article.title, "Category": 2, "Date": article.publish_date, "Keywords": keywords, "URL": article.url})

df = pd.DataFrame(l)
df.to_csv("tech.csv")

### Market Data Scraping
---



In [None]:
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
news_paper = newspaper.build("https://www.washingtonpost.com/climate-environment/", memoize_articles=False, user_agent=user_agent)



l = []
for article in tqdm(news_paper.articles[:200], desc="Processing articles"):
    article.download()
    article.parse()
    article.nlp()
    keywords = ' '.join(article.keywords)
    l.append({"Title": article.title, "Category": 3, "Date": article.publish_date, "Keywords": keywords, "URL": article.url})

df = pd.DataFrame(l)
df.to_csv("climate.csv")

### Reviews Data Scraping 
---



In [None]:
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
news_paper = newspaper.build("https://www.washingtonpost.com/business/technology/", memoize_articles=False, user_agent=user_agent)



l = []
for article in tqdm(news_paper.articles[:200], desc="Processing articles"):
    try :
        article.download()
        article.parse()
        article.nlp()
        keywords = ' '.join(article.keywords)
        l.append({"Title": article.title, "Category": 4, "Date": article.publish_date, "Keywords": keywords, "URL": article.url})
    except:
        pass
        
df = pd.DataFrame(l)
df.to_csv("technology.csv")

### Data Preprocessing
---

In this Step we are concatenating 4 dataframes we have creating will scraping data for 4 different sections.
Also dublicate values will be removed.

In [None]:
file1_path = '/kaggle/working/politics.csv'
file2_path = '/kaggle/working/tech.csv'
file3_path = '/kaggle/working/climate.csv'
file4_path = '/kaggle/working/technology.csv'

df1 = pd.read_csv(file1_path)
df2 = pd.read_csv(file2_path)
df3 = pd.read_csv(file3_path)
df4 = pd.read_csv(file4_path)

combined_df = pd.concat([df1, df2, df3, df4], ignore_index=True)
combined_df.to_csv('combined_file_lite.csv', index=False)


In [None]:
df = pd.read_csv("combined_file_lite.csv")
df = df[["Title", "Category", "Keywords"]]
df.tail()

In [None]:
df = df.drop_duplicates()
df.tail()

In [None]:
df.to_csv("combined_file.csv")

## Building ML Model

---
* After testing this complex data set with big model like bert, I found that I wont be able to complete training due to lack of computation power. As bert is heavy model.
* So I decided to move with Logistic Regression. For this 
1. Stop words were removed.
2. data was tokenized.
3. Then, data was embedded with the help of all-MiniLM-L6-v2 model from Transformer.

In [None]:
!pip install -U sentence-transformers -q

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
data = pd.read_csv("/kaggle/working/combined_file.csv" ,engine="python")
data.tail()

In [None]:
import spacy
import string
nlp = spacy.load("en_core_web_sm")
stop_words = nlp.Defaults.stop_words

In [None]:
punctuations = string.punctuation
print(punctuations)

In [None]:
def spacy_tokenizer(sentence):
    doc = nlp(sentence)
    mytokens = [ word.lemma_.lower().strip() for word in doc ]
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
    sentence = " ".join(mytokens)
    return sentence

In [None]:
data['tokenize'] = data['Keywords'].apply(spacy_tokenizer)

In [None]:
data.head()

In [None]:
data['embeddings'] = data['tokenize'].apply(model.encode)

In [None]:
X = data['embeddings'].to_list()
y = data['Category'].to_list()

In [None]:
data.head()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify=y)

### Model Evaluation
---

In [None]:
import os
import warnings

os.environ["TOKENIZERS_PARALLELISM"] = "false"
warnings.filterwarnings("ignore")

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression()
param_grid = {
    'penalty': ['l1', 'l2', 'elasticnet'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'max_iter': [50, 100, 200, 500 , 1000],  
}

grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_) 

In [None]:
# Step 6: Evaluate on Test Set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Step 7: Evaluate Performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(accuracy)
print("Classification Report:\n", report)

In [None]:
import joblib
best_model = grid_search.best_estimator_
joblib.dump(best_model, 'best_logistic_regression_model.joblib')

In [None]:
results_df = pd.DataFrame(grid_search.cv_results_)
results_df.to_csv('grid_search_results.csv', index=False)