<h1 align="center">PART I</h1>
<h1 align="center">Data Processing and Preparation for Training</h1>

The aim was to process the semantic analysis of English texts, so the collection of positive and negative reviews was taken from [here](http://www.cs.jhu.edu/~mdredze/datasets/sentiment/), this dataset is based on a reviews of various Amazon products.

To get started, install and unzip the archive ([unprocessed.tar.gz](http://www.cs.jhu.edu/~mdredze/datasets/sentiment/unprocessed.tar.gz)) and move the data to the project folder. Now, we import some libraries we need:

In [1]:
import sklearn
import numpy as np
import pandas as pd
import seaborn as sns
import xml.etree.ElementTree as ET

%matplotlib inline

Also let's write a style for alignment in the middle of all graphs, images, etc:

In [2]:
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

Since we only need files with positive and negative reviews, let's delete the others. And for our negative and positive reviews, we will add a parent tag in which all the reviews will be, so it will be easier to make xml tree. Also, our xml library will complain about the character `&` when parsing, so we will replace it with the word `and`:

In [3]:
import os

for parent, dirnames, filenames in os.walk('sorted_data'):
    for fn in filenames:
        if not (fn.startswith('positive') or fn.startswith('negative')):
            os.remove(os.path.join(parent, fn))
        else:
            filename = os.path.join(parent, fn)
            with open(filename, 'r') as original: data = original.read().replace("&", "and").replace("", "")
            with open(filename, 'w') as modified: modified.writelines(["<reviews>\n"] + data.splitlines(True)[1:-1] + ["</reviews>"])

print("Remove unnecessary files!\nAdd additional tag to reviews!")

Remove unnecessary files!
Add additional tag to reviews!


Next, since our reviews are in XML format, we need to process each line to get the value from the tags.

**Note**: 
* **`errors='ignore'`** used to skip all possible incorrect characters in our documents;

* **`readLines`** return all lines in the file, as a list where each line is an item in the list object.

In [4]:
test_read = open('sorted_data/music/positive.review', 'r', encoding='utf8', errors='ignore').readlines()
for i in range(10):
    print(test_read[i])

<reviews>

<review><unique_id>

B00008QS9V:it's_a_mighty_wind_a_blowin'!:carla_j._dinsmore_"acoustic_music_junkie"

</unique_id>

<unique_id>

3276

</unique_id>

<asin>

B00008QS9V

</asin>



In our directory with test training sets we have two files - with positive and negative reviews, each of these files contains 1000 unprocessed reviews. Let's check this:

In [5]:
pos_reviews = open('sorted_data/music/positive.review', 'r', encoding='utf8', errors='ignore').read()
neg_reviews = open('sorted_data/music/negative.review', 'r', encoding='utf8', errors='ignore').read()

pos_rev_tree = ET.fromstring(pos_reviews)
neg_rev_tree = ET.fromstring(neg_reviews)

pos_tags = pos_rev_tree.findall('review')
neg_tags = neg_rev_tree.findall('review')

print('\nNumber of Positive Reviews (Music):', len(pos_tags),
      '\nNumber of Negative Reviews (Music):', len(neg_tags))


Number of Positive Reviews (Music): 1000 
Number of Negative Reviews (Music): 1000


**Note:** 
* **`fromstring()`** parses XML from a string directly into an Element, which is the root element of the parsed tree;
* **`findall() `** finds all elements with a tag which are direct children of the current element (`reviews`).

<h2 align="center">XML Parse</h2>

Now we will try to parse our XML file at a basic level. We will create a dictionary from the lists. Each list will contain certain tags and their content for single review. We already have a tree with positive and negative reviews, so all that remains is to parse it. But before that, let's list all the tags in our xml files so we don't have to repeat:

In [6]:
REVIEW_TAGS = ['unique_id', 'asin', 'product_name', 'helpful', 'rating', 'title',
               'date', 'reviewer', 'reviewer_location', 'review_text']

In [7]:
def parseXML(xml_reviews):
    reviews = {}
    count = 0

    for item in xml_reviews:
        count += 1
        rev_name = 'review' + str(count)
        reviews[rev_name] = [
            REVIEW_TAGS[0] + ' | ' + item.find(REVIEW_TAGS[0]).text.strip(),
            REVIEW_TAGS[1] + ' | ' + item.find(REVIEW_TAGS[1]).text.strip(),
            REVIEW_TAGS[2] + ' | ' + item.find(REVIEW_TAGS[2]).text.strip(),
            REVIEW_TAGS[3] + ' | ' + item.find(REVIEW_TAGS[3]).text.strip(),
            REVIEW_TAGS[4] + ' | ' + item.find(REVIEW_TAGS[4]).text.strip(),
            REVIEW_TAGS[5] + ' | ' + item.find(REVIEW_TAGS[5]).text.strip(),
            REVIEW_TAGS[6] + ' | ' + item.find(REVIEW_TAGS[6]).text.strip(),
            REVIEW_TAGS[7] + ' | ' + item.find(REVIEW_TAGS[7]).text.strip(),
            REVIEW_TAGS[8] + ' | ' + item.find(REVIEW_TAGS[8]).text.strip(),
            REVIEW_TAGS[9] + ' | ' + item.find(REVIEW_TAGS[9]).text.strip()
        ]

    return reviews

Now let's actually create dictionaries with the required values and check output:

In [8]:
pos_reviews_dict = parseXML(pos_tags)
neg_reviews_dict = parseXML(neg_tags)
pos_reviews_dict['review1']

['unique_id | B00008QS9V:it\'s_a_mighty_wind_a_blowin\'!:carla_j._dinsmore_"acoustic_music_junkie"',
 'asin | B00008QS9V',
 'product_name | A Mighty Wind: The Album: Music: Various Artists',
 'helpful | ',
 'rating | 5.0',
 "title | It's a Mighty Wind a blowin'!",
 'date | July 5, 2006',
 'reviewer | Carla J. Dinsmore "Acoustic music junkie"',
 'reviewer_location | Wilmington, DE USA',
 'review_text | This is a wonderful album, that evokes memories of the 60\'s folk boom, yet contains original songs. I was amazed at the fantastic harmonies and musical arrangements.\nAnyone who loves the movie "A Mighty Wind" and who loves folk music will fall in love with this album. I know I did']

Next, we will turn our reviews dictionaries into a Pandas DataFrame:

In [9]:
def dict_to_dataframe(reviews_dict):
    # prepare our dataframe for the data
    df = pd.DataFrame(columns=REVIEW_TAGS)
    count = 0
    for val in reviews_dict.values():
        df.loc[count] = [
            val[0].split("|")[1], val[1].split("|")[1],
            val[2].split("|")[1], val[3].split("|")[1],
            val[4].split("|")[1], val[5].split("|")[1],
            val[6].split("|")[1], val[7].split("|")[1],
            val[8].split("|")[1], val[9].split("|")[1]
        ]

        count = count + 1

    return df

pos_music = dict_to_dataframe(pos_reviews_dict)
neg_music = dict_to_dataframe(neg_reviews_dict)

Let's check for correctness:

In [10]:
pos_music.head(n=3)

Unnamed: 0,unique_id,asin,product_name,helpful,rating,title,date,reviewer,reviewer_location,review_text
0,B00008QS9V:it's_a_mighty_wind_a_blowin'!:carl...,B00008QS9V,A Mighty Wind: The Album: Music: Various Artists,,5.0,It's a Mighty Wind a blowin'!,"July 5, 2006","Carla J. Dinsmore ""Acoustic music junkie""","Wilmington, DE USA","This is a wonderful album, that evokes memori..."
1,B00005JJ04:sometime_tuesday_morning_defies_de...,B00005JJ04,Sometime Tuesday Morning: Music: Johnny A.,4 of 4,5.0,Sometime Tuesday Morning defies description,"May 3, 2005",Tim Withee,"Auburn, WA United States","On one hand, this CD is a straight ahead inst..."
2,"B0002W4SGS:atreyu_jr:hellrun_""dustin""",B0002W4SGS,The Caitiff Choir: Music: It Dies Today,0 of 1,5.0,atreyu JR,"June 12, 2006","hellrun ""dustin""",wisconsin,this band reminds me of the thrill i first go...


In [11]:
neg_music.head(n=3)

Unnamed: 0,unique_id,asin,product_name,helpful,rating,title,date,reviewer,reviewer_location,review_text
0,"B00004YWGC:what_can_i_say?:ms._aj_""right""",B00004YWGC,Back For The First Time: Music: Ludacris,1 of 13,2.0,What can I say?,"May 4, 2006","Ms. AJ ""Right""","North Carolina, USA",I've always held the philosophy you are what ...
1,B000621498:not_quite_ready_for_prime_time.:g,B000621498,Things Aren't So Beautiful Now: Music: A Thor...,1 of 2,2.0,not quite ready for prime time.,"May 8, 2006",g,san francisco,someone get this band a producer and put them...
2,"B0000AQS1A:disapointment:super_dave_""super_dave""",B0000AQS1A,Chicken N Beer: Music: Ludacris,1 of 2,2.0,Disapointment,"July 1, 2006","Super Dave ""Super Dave""","Knoxville, TN",Tihs Album is not all that good when it came ...


Everything is correct, wonderful!