Start here: Analysis_OG.ipynb
This project applies concepts and techniques from Natural language processing and Opinion mining.The goal here is simply to build an artificial intelligience system that differentiates Hindi, Marathi code mixed with an english text on basis of their polarity. (ie positive, negative, neutral).overall.
Using natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.
The main goal is to get as much comments as possible for this model. We took these comments from major social media websites like facebook and Youtube related to social and political views from many sources which contributes in giving us data in from of polarities. We collected about 5000 comments.
Next step was to tag all the data according to their polarity(i.e. Positive, Negative, Nuetral). Tagging scheme was basically according to --
- Positive Comment : 3
- Negative Comment : 1
- Neutral Comment : 2
As the data is all tagged, before feeding it to the model we pre-process the data.The goal of preprocessing text data is to take the data from its raw, readable form to a format that the computer can more easily work with. Most text data, and the data we will work with in this article, arrive as strings of text. Preprocessing is all the work that takes the raw input data and prepares it for insertion into a model.
While preprocessing for numerical data is dependent largely on the data, preprocessing of text data is actually a fairly straightforward process, although understanding each step and its purpose is less trivial. Our preprocessing method consists of two stages: preparation and vectorization. The preparation stage consists of steps that clean up the data and cut the fat. The steps are 1. removing URLs, 2. making all text lowercase, 3. removing numbers, 4. removing punctuation, 5. tokenization, 6. removing stopwords, and 7. lemmatization. Stopwords are words that typically add no meaning.
train_test_split returns four arrays namely training data, test data, training labels and test labels. By default train_test_split, splits the data into 75% training data and 25% test data which we can think of as a good rule of thumb.
test_size
keyword argument specifies what proportion of the original data is used for the test set. Here we have mentioned the test_size=0.3 which means 70% training data and 30% test data.
Hyperparameter Tuning used on various algorithms such as linear Regression , XGBoost used in Analysis_OG.ipynb
Accuracy, precision, Recall and Fscore for every algorithm used is given in Values
Simply got overall accuracy around 70%
.
- Contextual understanding and tone
- sentiment analysis at Brandwatch?
- The caveats of sentiment analysis
- Predictions for the future of sentiment analysis
- Is the accuracy propotional or anyway dependent on the amount of data collected?
- the data source should closely match the intended uses? -- https://blog.infegy.com/understanding-sentiment-analysis-and-sentiment-accuracy
- Is sentence-level cross-lingual sentiment classification enough to predict the sentiment?