Skip to content

luisosorio3214/Fake-News-Prediction-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fake News Prediction System

Badge Source

Authors

Table of Contents

Business Problem

This app predicts if a news article contains misinformation which can mislead the public and bring social conflicts. In era of technology and social media, fake news has been more of an issue it has become quite difficult to distinguish the validity of the news. The danger of fake news can manipulate people's perception of reality, influence politics, and promote false advertising. This app will predict the probability of the news article of containing fake news. This app is intended for everyone in which the article you reading is in question and want to investigate further before taking anything at face-value.

Data Source

Methods

  • Exploratory Data Analysis
  • Text Cleaning using Regex
  • Term Frequency-Inverse Document Frequency
  • Modeling
  • Deployment

Tech Stack

  • Python (Refer to requirements.txt for the packages used in this project)
  • Streamlit (Interface for model)
  • Google Drive (Data Storage)

Natural Language Processing

Since we are given a data frame consisting of observations of articles, in order to implement our machine learning models which only understand numbers we have to figure out how to convert our texts into specific mathematical terms. In other words we must tokenize our words so we are able to extract meaning or a pattern for our specific use case. Recall, the goal is to determine whether an article contains fake news or not thus we should consider tokenizing words rather than sentences since certain words in a article might make an article stand out for containing misinformation. I ended up choosing a vectorizer called TF-IDF (Term Frequency-Inverse Document Frequency) which captures the importance of a word in a document relative to a corpus of documents. Before applying the vectorizer, we clean the text using regex and got rid of special characters or words that provide no special use case in our specific prediction. Now here is a step-by-step explanation on how TF-IDF vectorization works:

  1. Term Frequency (TF): TF measures the importance of a term within a document. It calculates the frequency of a term (word) in a document divided by the total number of terms in that document. The assumption is that the more frequently a term appears in a document, the more important it is.
  2. Inverse Document Frequency (IDF): IDF measures the rarity or uniqueness of a term across the entire corpus of documents. It is calculated by taking the total number of documents in the corpus divided by the number of documents that contain the term. IDF assigns higher weights to terms that are less common in the corpus, as they are considered more informative or distinctive.
  3. TF-IDF Calculation: The TF-IDF score for a term in a document is obtained by multiplying the TF value with the IDF value for that term. This process is repeated for all terms in each document, resulting in a numerical representation of the document based on the importance of its terms relative to the entire corpus.
  4. Vectorization: After calculating the TF-IDF scores, the TF-IDF vectorizer converts each document into a vector. Each component of the vector represents the TF-IDF score for a specific term in the document.

Quick Glance at the Results

Confusion Matrix of Decision Tree Classifier.

ROC curve of Decision Tree Classifier.

Top 3 models on the testing set (with default parameters)

Model Accuracy
Logistic Regression 98%
Decision Tree 99%
Gradient Boosting 99%

  • Final Model used: Decision Tree Classifier
  • Why choose Decision Tree Classifier compared to the other models: I chose Decision Tree compared to gradient boosting since decision tree is the more efficient model. In terms of computation power the gradient boosting was far more expensive since it is actually running multiple decision trees in the background and the accuracy stayed about the same compared to the single decision tree. A good argument was to use Logistic Regression model since we can toy around our metrics and sacrifice a single percent in our accuracy metric. Logistic Regression is also an adequate model for this situation and provides more flexibility, however, it was necessary.
  • Metric used: Accuracy
  • Why choose Accuracy as a metric: Now the goal for this project is to correctly identify a false news article, however it is important to keep in mind the consequences of incorrectly identifying a news article as fake, when it actuality it was some real news. In this case we actually want to consider a high specificity rate but then we don't want to have a low metric in recall. Therefore, we want an even balance of both scores so having a good accuracy score would be the best solution. Also its important to note that when our target variable has imbalanced classes, more real articles than fake ones in our case then it is sometimes preferred to use the F1-score metric since we have to take into account the discrepancies the imbalanced classes might contribute to our accuracy. However, in the modeling phase our models actually performed fairly well despite being using unseen data which is why I choose accuracy over F1-score.

Lessons Learned and Recommendation

  • This project involved natural language processing and through the process I learned the importance of cleaning the text before the modeling phase. Once I cleaned the text using regex my accuracies improved and therefore the model was more equipped in learning the patterns between the documents. Now every piece of text you encounter will require different cleaning steps and its important to skim through various documents to see what steps must be taken in order to create the most optimal model.
  • Another crucial step during the natural language process is picking the adequate algorithm to tokenize the words in the document. This will align with the goal of your project and then appropriate algorithm will tokenize the words in a special manner to achieve that said goal. Some of the most frequently used vectorizer include Count Vectorizer, TF-IDF Vectorizer, Word2Vec, and Bert which all takes words and plots them onto a multi-dimensional space differently hence, the word vectorizer. This is where my well-equipped understanding of Linear Algebra comes into play and helps me better understand each algorithm.

Limitation and what can be Improved

  • The first big limitation about this project is that our data merely scratches the surface of all the articles available on the internet. In order to create a big model where we actually are able to detect false news articles we would need to scrape most of the articles off the internet and then train our model. That is an awful amount of data, however, I think it will be necessary for the better of our future. All this artificial Intelligence is making it harder to detect misinformation which extends more than news articles and now extending to false audio/videos. We definitely need to build a system that can detect the real ones from the fake to conserve our perception of reality.
  • Next time, it might be better if we used our Logistic Regression model and use different metrics to optimize our specific results. We might consider accurately predicting misinformation at the cost of misidentifying some real news articles. Depending on our purpose we can surely investigate this further.
  • Another step we can improve or investigate further is our text cleaning phase where we might want to make further adjustments.
  • Another huge limitation is language barriers, where our model focused only on english news articles and might not even recognize no other language. This goes to show to make a large scale model it will take a lot of training data and compute power.

Run Locally

First, Open your Command line or Terminal and head to a directory where you want to save the project.

Initialize git

          
          git init
          
        

Clone the Project

          
          git clone https://github.com/luisosorio3214/Fake-News-Prediction-System.git
          
        

Head to project directory

          
          cd Fake-News-Prediction-System
          
        

Create a virtual environment using venv

          
          python -m venv "env_name"
          
        

Activate virtual environment

          For Window Users
          
          env_name\Scripts\activate
          
          For Mac Users
          
          source env_name/bin/activate
          
        

Install required dependencies from requirements.txt file

          
          pip install -r requirements.txt
          
        

Start the streamlit server locally

          
          streamlit run app.py
          
        

If you are having issues with streamlit, please follow this tutorial on how to set up streamlit.

Explore the notebook

To explore the notebook file click here.

Deployment on streamlit

To deploy this project on streamlit share, follow these steps:

  1. Make sure you have a github repository with full project files including the requirements.txt file
  2. Go to streamlit share
  3. Login with Github, Google, etc.
  4. click on new button
  5. Select the GitHub repo, branch, python file with the streamlit codes
  6. Click Save and Deploy

App deployed on Streamlit

Video to gif tool

Contribution

Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change or contribute.

License

MIT License

Copyright (c) 2022 Stern Semasuka

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Learn more about MIT license