<a href="https://colab.research.google.com/github/mikazz/TripAdvisor/blob/master/data_flow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wyciąganie ciekawych informacji z plików źródłowych. Parsowanie plików *.dat

## Przykładowy plik  *.dat

```
<Overall Rating>4.5
<Avg. Price>$124
<URL>Chelsea Lodge

<Author>NYGuest
<Content>Torturous Stay -> learned the hard way We were upgraded without our knowledge and then charged for it. We stayed for three days. Our room was a dungeon below street level. The walls are paper thin, and there was quite a bit of noise from other guests. The room was not clean. There were spots of what appeared to be blood above the sink in the bathroom, and clearly they have never cleaned the walls or doorknobs in the bedroom. The sheets and towels are threadbare - literally - with holes in them. There were brown stain streaks on both sets of sheets we infested. There was no temperature control in the room. We were at the mercy of the guests above us. The first two nights we froze and the last night we roasted. For some reason they decided to not clean our room on the second day. I can only surmise as a cost saving measure.We are staying in New York for 9 nights and if they didn't have a 72 hour no-cancellation policy we would have left after the first night. We are now staying in a much more comfortable and squeaky clean hotel that is $80 less a night (booked through hotline dot com)The hotel has no service whatsoever. It's basically a motel with no parking. To me that means 1 star. We are feeling ripped off at this point. 190/night for a room in a converted basement is absurd in any city. For the life of me I cannot explain our experience. I read through previous reviews carefully and we expected some charm here. Alas this was not the case. I am left with the suspicion that the hotel has spammed this board with a large number of their own reviews. I would unequivocally not recommend this establishment. There are many comfortable and clean hotels in New York that have a number of guest services and can be had for less money if you merely take the time to shop on the internet at any of a number of web sites. 
<Date>Jan 4, 2009
<img src="http://cdn.tripadvisor.com/img2/new.gif" alt="New"/>
<No. Reader>3
<No. Helpful>1
<Overall>1
<Value>1
<Rooms>1
<Location>3
<Cleanliness>1
<Check in / front desk>2
<Service>-1
<Business service>3
```



## Parsowanie do formatu csv

In [0]:
import glob
import re

PATH = "reviews_folder\*.dat"


files = glob.glob(PATH)
for name in files:
    try:
        with open(name) as file:
            file_name = name.split('\\')
            file_name = file_name[1]
            name = file_name.split(".")
            name = name[0]
            
            content = file.read()
            
            rating_list = re.findall('<Overall Rating>(.*)', content)
            price_list = re.findall('<Avg. Price>(.*)', content)
            url_list = re.findall('<URL>(.*)', content)
            comment_number = len(re.findall('<Author>(.*)', content))

            fmt = '{};{};{};{};{}'
            for i, (rating, price, url) in enumerate(zip(rating_list, price_list, url_list)):
                print(fmt.format(name, rating, price, url, comment_number))

            
        
    except IOError as e:
        print(e)


## Wynik
## name; overall_avg; price; url; number_of_comments

```
hotel_81251;3.5;$179;http://www.tripadvisor.com/ShowUserReviews-g60713-d81251-r23352903-Opal_Hotel-San_Francisco_California.html;169
hotel_81315;4;$367;http://www.tripadvisor.com/ShowUserReviews-g60713-d81315-r23333280-San_Francisco_Marriott-San_Francisco_California.html;236
```



# Parsowanie plików *.txt

## Przykładowy plik *.txt z hotelami

```
<Author>RW53
<Content>Location! Location?       view from room of nearby freeway  
<Date>Dec 26, 2008
<Rating>3	4	3	2	4	3	-1	-1	
```



## Parsowanie do postaci:
  komentarz \n  
  komentarz \n  
  komentarz \n  

In [0]:

import re
import glob


content_list = []

date_lists = []
PATH = "HOTELS\*.txt"


def write_rating_to_file(rating_lists, file_name):
    for list in rating_lists:
        #print(line)

        with open(file_name, "a") as file:
            file.write(str(list[0]) + "\n")


def write_to_file(content_list, file_name):
    for line in content_list:
        with open(file_name, "a") as file:
            file.write(line + "\n")


def content_parser(content):
    for i in content:
        # Clean nasty parsing errors ( showReview(1020542, 'full'); and etc )
        i = re.sub(r'(showReview)(.*)(;)'," ", i)
        # Clean all whitespace characters (space, tab, newline, return, formfeed) too
        i = " ".join(i.split())
        # Append cleaned line
        content_list.append(i)


def rating_parser(rating):
    rating_list = []
    for i in rating:
        i = i.split("\t")
        rating_list.append(i)

    for i in rating_list:
        # remove all occurrences of whitespace (replace)
        i[:] = [x for x in i if x is not '']
        # convert all strings to ints (replace)
        i[:] = [int(x) for x in i]

    return rating_list


def main():
    files = glob.glob(PATH)
    for name in files:
        with open(name) as file:
            file_name = name.split('\\')
            file_name = file_name[1]
            print("Parsed: " + file_name)

            lines = file.read()
            # Find all text with <Content> at the beginning
            content = re.findall('<Content>(.*)', lines)
            content_parser(content)

            # Rating lists
            rating = re.findall('<Rating>(.*)', lines)
            rate = rating_parser(rating)
            #print(rate)
            write_rating_to_file(rate, "R" + file_name)
            
            # Date lists
            date_lists = re.findall('<Date>(.*)', lines)

        new_file_name = "P" + file_name
        write_to_file(content_list, new_file_name)
        global content_list
        content_list = []


    print("Done Parsing")

    
if __name__ == "__main__":
    main()
