This library contains all the essential functions for data cleaning.
It takes a list of data cleaning parameters and either a string or pandas dataframe as input
Functions:
- Remove new lines
- Remove emails
- Remove URLs
- Remove hashtags (#hashtag)
- Remove the string if it contains only numbers
- Remove mentions (@user)
- Remove retweets (RT...)
- Remove text between the square brackets [ ]
- Remove multiple whitespaces and replace with one whitespace
- Replace characters with more than two occurrences and replace with one occurrence
- Remove emojis
- Count characters (only for dataframe; creates a new column)
- Count words (only for dataframe; creates a new column)
- Calculate average word length (only for dataframe; creates a new column)
- Count stopwords (only for dataframe; creates two new columns, stowords and stopword_count)
- Detect language (uses fasttext-langdetect) (only for dataframe; creates two new columns, lang and lang_prob)
- Detect language (uses fasttext-langdetect) (only for dataframe; creates just one column with langauge and probability; takes less time)
- Remove HTML tags
pip install cleanmydata
- lst (list) - List of data cleaning operations
- data (string or dataframe) - Data to be passed
- column (string) - Dataframe column on which operation to perform; only for dataframe
- save (boolean) - If you want to save the results in a new file
- name (string) - Name of the new file if save is True
- Import the library
from cleanmydata.functions import *
- Call the method clean_data, and pass the parameters as you wish.
- By default, if the dataframe is passed, it drops all NA values (dropna)
- To remove emails and hashtags
mydata = "Hello folks. abc@example.com #hashtag"
mydata = clean_data(lst=[2, 4], data=mydata)
print(mydata)
- To count stopwords, remove mentions, and URLs, and save file from a dataframe
df = pd.read_csv('data/my_csv.csv', encoding='ISO-8859-1', dtype='unicode')
df = clean_data(lst=[15, 6, 2], data=df, column='comments', save=True, name='my custome file name')
If using stopwords, make sure you have en_core_web_sm installed.
python -m spacy download en_core_web_sm