-
Notifications
You must be signed in to change notification settings - Fork 0
may-tal/nlpProject
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The project contains the following files: * main.py - Main project file. * classifiers.py - This file gets a tagged set of the data, split it to train and test. then, the classifier learns how to classify from the training set and predict tags to the test set. This file contains few classifiers and return the scores of each one. * evaluation.py - Compute all measure scores for given classifier and plot the roc curve. * clustring.py- This file cluster the data to three class using k-means algorithm and plot wordCloud graph. * data.py- this file gets folder path that contain the data files and return the data as csv form. * feature_extraction.py- Transforming raw data into features that better represent the underlying problem, resulting in improved predictive model accuracy on unseen data (feature engineering process) * feature_selection.py- This file receives train data and selects features in three methods of the feature selection algorithm - removing features with low variance, selecting the best features based on univariate statistical tests, and select from models' method. * text_statistics.py- this file contains functions that get the data after preprocessing and return the statistic of the data. * text_normalization.py- this file contains all the functions of the preprocessing step - clean and normalization the data. * topic_modeling.py- this file use several methods of topic modeling – SVD, NFS. * heb_stopwords.txt- text file that contain the Hebrew stop words. * Data/sentences.neg- contain the data whose labelling is negative. * Data/sentences.pos- contain the data whose labelling is positive. * sentiment_lexicon/negative_words_he.txt – Hebrew semantic lexicon for negative words. * sentiment_lexicon/positive_words_he.txt – Hebrew semantic lexicon for positive words. * orig_data.csv - contain the original data as dataframe, each line contain text and label. * norm_data.csv - contain the normalization (clean + yap) data as dataframe. * yap_punc_data.csv - contain the data after punctuation removeal and YAP as dataframe. * clean_data.csv - contain the clean data as dataframe. * norm_no_yap_data.csv - contain the normalization data without yap as dataframe. * words_vector.npy – The list of representing vectors (arrenged according to the word_list.txt. * word_list.txt - the list of words which there are representative word2vec vectors. ---- How to run our project? ---- To run this code, you need to download twitter-w2w from https://drive.google.com/drive/folders/1b1Pj1oWBqs3y0Qncaqpz4IK-ujzChy2Z and pace the two files word_list.txt and words_vector.npy in the project folder. Then all you need to do is install the relevant packages and run 'main.py' file.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published