Tokenizer

Features

Tokenize words
Removes stop words
Stems the word using NLTK Lancaster stemmer i.e playing => play , played => play
Removes punctuations and irrelavant charecters like ( !@#$%& )
Removes more than 2 repeatition of charecters in a text like heyyyy => heyy , yeaaaahh => yeaahh
Removes links,#tags and @username references
Removes numbers like 1, 12 but not words like g8 , n8 , m8 etc

The data to be processed is assumed to be saved in a file called data.txt.
The stopwords.p contains basic stop words.
Custom stop words list could be created by writting the words line by line in a stop.txt file and running the stop_pickle.py script
If u dont need to remove stop words create an empty stopwords.p file and save it in the same directory as Tokenizer.py or remove the appropriate code from Tokenizer.py

data.txt
- Write each data line by line
stop.txt
- Write each word in a line
- Run the stop_pickle.py script to pickle stop.txt file to stopwords.p file

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
Tokenizer.py		Tokenizer.py
stop_pickle.py		stop_pickle.py
stopwords.p		stopwords.p