# URL Title Counter

**Main task:** take a log file, check the whole content, take and count main themes/topics/titles in the log items, show message frequency according to the highlighted titles.


### 1. Checking the log file
The first step is to open the file and looking through its main content. As we can see, the file contains 100 elements, some of them are empty lines, some – just topics or lexems without messages. All this is shown in the first 20 elements of the file.

In [6]:
path = r'C:\Books\result\data\urls.txt'
prev_check = open(path, 'r', encoding='utf-8')
list_prev_check = prev_check.readlines()
print(len(list_prev_check), list_prev_check[0:20], sep='\n'*2)


100

['/\n', '/starlife/\n', '/world/\n', '/latest/\n', '/incidents/\n', '/politics/\n', '/business/\n', '/kiev_city/\n', '/head/\n', '/?updated=top\n', '/politics/36188461-some-video/\n', '/world/36007585-some-article/\n', '/science/36157853-some-video/\n', '/video/36001498-some-video/\n', '/world/36007585-some-article/?smi2=1\n', '/science/\n', '/sport/\n', '/middleeast/36131117-some-video/\n', '/economics/36065674-some-article/\n', '/politics/36118047-some-video/?smi2=1\n']


As the file does not have much content, we can look on the all of the items.

In [7]:
print(*list_prev_check)

/
 /starlife/
 /world/
 /latest/
 /incidents/
 /politics/
 /business/
 /kiev_city/
 /head/
 /?updated=top
 /politics/36188461-some-video/
 /world/36007585-some-article/
 /science/36157853-some-video/
 /video/36001498-some-video/
 /world/36007585-some-article/?smi2=1
 /science/
 /sport/
 /middleeast/36131117-some-video/
 /economics/36065674-some-article/
 /politics/36118047-some-video/?smi2=1
 /travel/36194479-some-article/
 /politics/35638742-some-video/
 /video/36012692-some-article/
 /starlife/36174817-some-video/
 /health/36149308-some-article/
 /science/36139723-some-video/
 /cis/36229699-some-article/
 /incidents/36225203-some-video/
 /politics/36118047-some-article/
 /world/36075956-some-video/
 /politics/36115220-some-article/
 /world/36018565-some-video/
 /politics/36150505-some-article/
 /middleeast/36131117-some-video/?smi2=1
 /sport/36055585-some-article/
 /crazy-world/36193471-some-video/
 /crazy-world/36087352-some-article/
 /incidents/36096689-some-video/
 /video/36225009

On the whole, we can conclude the following statements about the titles:
    1) some of them are put without any messages;
    2) some of them are put without any messages for the first time but then they have their own messages;
    3) and some of them are always have all the necessary information.

Url title counter should count all the mentioned variants.

### 2. Creating the url title counter

The main task required the following steps: 1) clear all the log file lines from extra spaces/elements and blank lines; 2) separate main title; 3) (optional) take an actual message from the file. All this elements were put and shown separately. Also the main counter was built as dictionary ‘Result’ with themes (keys) and counts (values).

In [8]:
result = dict()

with open(path, encoding='utf-8') as data:
        
    for line in data.readlines()[1:]:
        clear_line = line.strip().split('/')[1:-1]
        
        if len(clear_line) == 1:
            print(f'Topic: {clear_line[0]}', 'No message', sep='\n')
            result.setdefault(clear_line[0], 0)
            
        elif len(clear_line) > 1:
            print(f'Topic: {clear_line[0]}', clear_line[1], sep='\n')
            result[clear_line[0]] = result.get(clear_line[0], 0) + 1
        
        else:
            continue
        
        print('-'*10)

Topic: starlife
No message
----------
Topic: world
No message
----------
Topic: latest
No message
----------
Topic: incidents
No message
----------
Topic: politics
No message
----------
Topic: business
No message
----------
Topic: kiev_city
No message
----------
Topic: head
No message
----------
Topic: politics
36188461-some-video
----------
Topic: world
36007585-some-article
----------
Topic: science
36157853-some-video
----------
Topic: video
36001498-some-video
----------
Topic: world
36007585-some-article
----------
Topic: science
No message
----------
Topic: sport
No message
----------
Topic: middleeast
36131117-some-video
----------
Topic: economics
36065674-some-article
----------
Topic: politics
36118047-some-video
----------
Topic: travel
36194479-some-article
----------
Topic: politics
35638742-some-video
----------
Topic: video
36012692-some-article
----------
Topic: starlife
36174817-some-video
----------
Topic: health
36149308-some-article
----------
Topic: science
3613972

All the results were displayed sorted with titles and their frequency (reversed).

In [9]:
total = 0
for key, value in sorted(result.items(), key=lambda both: (both[0][0], -both[1])):
    print(f'Topic "{key}"; items overoll: {value}')
    total += value

Topic "articles"; items overoll: 7
Topic "auto"; items overoll: 0
Topic "business"; items overoll: 4
Topic "cis"; items overoll: 4
Topic "crazy-world"; items overoll: 2
Topic "economics"; items overoll: 3
Topic "europe"; items overoll: 1
Topic "finances"; items overoll: 1
Topic "head"; items overoll: 2
Topic "health"; items overoll: 2
Topic "incidents"; items overoll: 5
Topic "kinomusic"; items overoll: 1
Topic "kiev_city"; items overoll: 0
Topic "lifestyle"; items overoll: 1
Topic "latest"; items overoll: 0
Topic "middleeast"; items overoll: 3
Topic "politics"; items overoll: 10
Topic "starlife"; items overoll: 12
Topic "science"; items overoll: 5
Topic "sport"; items overoll: 2
Topic "scitech"; items overoll: 0
Topic "travel"; items overoll: 1
Topic "video"; items overoll: 10
Topic "world"; items overoll: 8


In [10]:
print('Total result:', total)


Total result: 84


### 3. Conclusion

The creation of url title counter took several steps: 1) clearing with log files; 2) checking the main content; 3) building the final result. Overoll, as we can see, log file has 84 entries with titles and not each of them has their message.