Skip to content

manghat/python-remove-html-from-csv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Removes HTML tags from a column in a .csv file

About :

The python script runs 2 versions of cleaning and returns a file with 4 additional columns:

  1. Regex matching with "<>" , "&;"(with 4 or 5 characters in between) anything in between will be removed and "\*" will be replaced with a white space character. Note: the special characters will simply be removed. eg: &nbsp; &rpos; etc.
  2. BeautifulSoup HTML to text conversion. This will remove HTML tags and convert special characters into their respective ASCII characters
  3. 2 parity columns which will return the difference in the number of charcters between the newly generated columns and the original columns. (This is basically a flag that you can check if there has been too many characters replaced)

How to use

You need to install these modules:

  • pandas
  • bs4
  • lxml example: python -m pip install bs4 lxml pandas
  1. Place the file in the same directory as the csv file
  2. open terminal at the file location windows : ctrl+ r then cmd then cd <path to file>
  3. Type: python remove_html.py and hit enter
  4. Follow the instructions
  5. You are done.

Future plans

  1. Auto detect filetype
  2. multicolumn support

About

This python script can be used to remove HTML from text in a particular column in a CSV file.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages