Skip to content

Bring your pdf documents to life with a customizable WordCloud.

License

Notifications You must be signed in to change notification settings

gstaxy/pdf2wordcloud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF 2 WordCloud

Bird migration survey protocol - Government of Alberta, Canada

Description

Brind your PDF documents to life using wordclouds to represent keywords and important topics in the text. Sometimes overlooked, wordclouds are a very useful tool to use to summarize information quickly. This repository also includes the possibility to easily add a mask (shape) to the wordcloud.

Getting started

To use or develop the repo locally, fork this repository to your GitHub account and/or then clone it to your computer. To clone the master branch locally, navigate to the directory from the console.

> git clone -b master https://github.com/gstaxy/pdf2wordcloud.git

Setup environment

This app was built in a specific working environment configuration to maintain all functionnalities. To get familiar with virtual environments, please read the tutorial I wrote on the subject for Windows and Linux users. From the console (here with PowerShell), run the following lines:

# If not installed already
> pip install virtualenv
> virtualenv --version
virtualenv 20.0.18

# Create the virtual environment
> virtualenv venv --python=python3.7.6

# Activate the virtual environment
> venv/Scripts/activate.ps1

# Install the library requirements
(venv)> pip install -r requirements.txt

Now, the environment is ready to generate wordclouds!

Generate the WordCloud

  1. Drop the PDF document in the folder pdf_files/.
  2. In config.py, replace the pdf_filename in FILENAME by the document name to use.
  3. Customize the wordcloud look and content from config.py. More details in the Customization section.
  4. From the root directory, run this line in the console to generate the wordcloud.
(venv)> py main.py
  1. All the processing steps will be described in the console and the image will appear in a separate window once it's ready. Simutalneously, the wordcloud will be saved in the saved_wc/ folder.

Customization

Most of the wordcloud configurations are located in config.py and are directly loaded from there when running main().

Stopwords

The most common stopwords are already filtered with the nltk library in the text cleaning step. The add custom stopwords, copy the examples/stopwords.txt starter file in the root directory and customize it.

Size of the image

The current size configurations are specific to LinkedIn profile banners. To customize the image size, change the pixel length of FIG_HEIGHT and FIG_WIDTH in config.py. Some common image sizes used on social medias can be found on this website.

Colors

The background BG_COLOR and text WORDS_COLORMAP color can both be changed in config.py. Available matplotlib colormaps can be found here.

Number of words

The number of words can be changed under NUM_OF_WORDS in config.py

Mask

An image outline can be added to the wordcloud to represent a specific shape. To do so, find a .png image and copy its URL in IMAGE_LINK in the config.py file. Make sure the URL link finishes with .png once it's copied. The black outline can be modified or removed by modifying arguments in lib/cloud.py.

Language

The default language used to filter the text is English. To change it, modify the line 28 from lib/cleaning.py to the desired language. The custom stopwords will also need to be changed accordingly.

Examples

Here are some replicable examples. Source images are located in examples/ folder.

Click on any wordcloud image to open pdf source link.

Reports

Oil sand annual monitoring report - Government of Alberta, Canada

Books

1984 - by George Orwell

Pride and Prejudice - by Jane Austen

Robinson Crusoe - by Daniel Defoe

Resumes

Resume samples - Bellevue University


Future improvements

  • Add aggparse to the main() function to modify its arguments directly from command line.

Contributions

Any contribution to the project is welcomed and encouraged. To propose an addition or improvement, please start an Issue or make a Pull Request.

Some resources

About

Bring your pdf documents to life with a customizable WordCloud.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages