Django REST API backend for a web app that analyses the positivity of a news site. Links to...
The back-end is a Django RESTful API hosted on Heroku's free tier. It's completely separate from the front-end and the codebase sits in this repo.
There's only 1 view in this Django project and it corresponds to the API endpoint that connects to the front-end. It receives the URL of the news site and returns the analysis data. I leverage Django REST framework to set up this view using the generic APIView
class.
2 models that database all the site submissions together with the data generated from my analysis:
WebsiteModel
- sites that were successfully analysedFailedWebsiteModel
- sites where the back-end couldn't successfully send analysis results back to the front-end
This enables me to see all the data in my Django admin panel and discover gems 😄
The scoring process starts by going to the URL that has been sent to the API. Upon successful entry, it uses Beautiful Soup to fetch all the text and splits it into a list
according to their HTML elements.
I remove any text piece that meets any of the following criteria:
- Includes site-generic terms such as "cookie" or "sign up".
- Doesn't have enough words; the sentiment model struggles to analyse them.
- Are duplicates
I also apply some encoding/decoding black magic using the text_transform
function to remove odd characters.
I selected the VADER and AFINN sentiment analysis libraries for generating my site positivity scores. Both are well-known in the NLP space. I tested them on a few samples as well; they seem to be quite reliable and complementary of each other.
I go through each piece of text and compare the scores generated by the two libraries. The aggregate score is based on several situations:
- Both scores are of the same sign -> aggregate score is +1.
- One score is 0 while another is non-0 -> aggregate score is +/-1 depending on the sign of the score.
- The two scores are of opposite signs -> aggregate score is 0 because the models seem unreliable.
- The two scores are 0 -> aggregate score is again 0.
You'll notice I convert all scores to +/-1 and 0; using the actual magnitudes didn't produce reliable results for me.
Finally, I calculate the entire site's absolute score by dividing the number of (+1) text pieces by the number of non-0 text pieces. But we're not done yet :)
I retrieve all the absolute scores that are stored in the WebsiteModel
and add the site-in-focus' score to the list. I then use pandas to group them by the mean values for each URL.
With scikit learn, I scale all the scores so that they're between 0 and 1. 0 represents the least positive score in our pandas
dataframe, while 1 represents the most positive score.
This scaled dataset is what's sent back to the front-end for the wonderful results page. The end 🥳
Install the Python packages
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
You might encounter issues with installing psycopg2
on Linux (Ubuntu). Run the below as per the installation guide and this:
sudo apt-get install python3-dev libpq-dev build-essential
export PATH=/usr/lib/postgresql/X.Y/bin/:$PATH
Create a PostgreSQL db.
Create a .env
file that stores the following...
DJANGO_ENV=development
SECRET_KEY=create_your_key
DB={"ENGINE": "django.db.backends.postgresql", "NAME": "your_db_name", "USER": "your_db_username", "PASSWORD": "your_db_password!", "HOST": "localhost", "PORT": "5432"}
Set up your database tables
python manage.py migrate
Run the app
python manage.py runserver
Test the app
python manage.py test
I set up a workflow that...:
- Creates a PostgreSQL database service
- Tests the single API endpoint with a good request
- Tests the API with a bad request
heroku login
git push heroku master
heroku ps:scale web=1
Add the environment variables as per the local setup.
Licensed under Mozilla Public License 2.0.