## **Script para tratar os arquivos originais da base**

Esse processo não analisará o conteúdo dos dados, apenas nas raras excecões.

A finalidade aqui é preparar as informações para serem carregadas no outro Jupyter onde serão feitas as análises

O arquivo 2000_orig.txt o valor da ordem da música está depois do nome da banda/música

Os arquivos de 2001 a 2005 o valor da ordem da música está antes do nome da banda/música, e tem um "." (ponto) depois do valor

O arquivo de 2006 o valor da ordem da música está antes do nome da banda/música, e não tem um "." (ponto) depois do valor

In [1]:
import json
import os
import re
from collections import Counter

In [2]:
folder_original = "resources/bases/original/"
folder_threated = "resources/bases/tratada/"

In [3]:
first_year = 2000
last_year = 2023

In [4]:
def int_try_parse(value):
    try:
        return int(value), True
    except ValueError:
        return value, False
    except Exception:
        return value, False

In [5]:
def read_file_on_line(path_to_file: str):
    """
    Read a file and return one single line
    """
    with open(path_to_file) as f:
        return f.read()

In [6]:
def read_file_lines(path_to_file: str):
    """
    Read a file and return one single line
    """
    if not os.path.exists(path_to_file):
        return []
    
    with open(path_to_file) as f:
        return [stripped for line in f if (stripped := line.strip())]

In [7]:
def threat_artist_with_hyphen_in_name(text: str) -> str:
    text = text.upper()
    text = text.replace("A-HA", "A HA")
    text = text.replace("B-52", "B 52")
    return text
    

In [18]:
def standard_artist_name(artist: str) -> str:
    artist = artist.upper()
    artist = artist.replace("THE BEATLES", "BEATLES")
    return artist
    

In [9]:
def standard_song_name(artist: str, song: str) -> str:
    if artist.upper().__contains__("REX") and song.upper().__contains__("GET IT ON"):
        return "BANG A GONT (GET IT ON)"
    return song

In [10]:
def save_new_file(items: list, year: int):
    path = f"{folder_threated}{year}_ok.txt"
    
    if os.path.exists(path):
        os.remove(path)
   
    with open(path, 'w') as f:
        for item in items:
            f.write(f'{item["position"]}|{item["artist"]}|{item["music"]}\n')

In [11]:
def validate_ammount(list_songs):
    return len(list_songs) == 500

### **Script abaixo está como exemplo, não precisa rodá-lo**

Caso precise para estudo, está ai disponível

In [12]:
"""
files = [
    "2003.txt",
    "2004.txt",
]
 
for file in files:
    path_to_file = f"{folder_original}{file}"
    contents = read_file_on_line(path_to_file)
    
    words_list = contents.split() 
    words_list = [word for word in words_list]
    
    pattern = "[\d]{1,}[\.][\s][\D]{1,}[-][\D]{1,}"
    song_list = []

    itens_in_list = re.findall(pattern, contents)
    
    for item in itens_in_list:
        point_position = item.find(".")
        order, _ = int_try_parse(item[:4].replace(".", ""))
        
        song_list.append(
                {
                    "position": order,
                    "artist": item[point_position + 1:].split("-")[0].strip().upper(),
                    "music": item[point_position + 1 :].split("-")[1].strip().upper()
                }
            )
        
    save_new_file(song_list, file[:4])
    print(f"Lista de {file} está como {validate_ammount(song_list)} tem {len(song_list)} registros")
"""

'\nfiles = [\n    "2003.txt",\n    "2004.txt",\n]\n \nfor file in files:\n    path_to_file = f"{folder_original}{file}"\n    contents = read_file_on_line(path_to_file)\n    \n    words_list = contents.split() \n    words_list = [word for word in words_list]\n    \n    pattern = "[\\d]{1,}[\\.][\\s][\\D]{1,}[-][\\D]{1,}"\n    song_list = []\n\n    itens_in_list = re.findall(pattern, contents)\n    \n    for item in itens_in_list:\n        point_position = item.find(".")\n        order, _ = int_try_parse(item[:4].replace(".", ""))\n        \n        song_list.append(\n                {\n                    "position": order,\n                    "artist": item[point_position + 1:].split("-")[0].strip().upper(),\n                    "music": item[point_position + 1 :].split("-")[1].strip().upper()\n                }\n            )\n        \n    save_new_file(song_list, file[:4])\n    print(f"Lista de {file} está como {validate_ammount(song_list)} tem {len(song_list)} registros")\n'

### **Script abaixo está como exemplo, não precisa rodá-lo**

Caso precise para estudo, está ai disponível

In [13]:
"""

year = "2005"
path_to_file = f"{folder_original}{year}.txt"
contents = read_file_on_line(path_to_file)
pattern = "[\d]{1,}[\.][\s][\D]{1,}"
song_list = []

itens_in_list = re.findall(pattern, contents)

for item in itens_in_list:
    point_position = item.find(".")
    order, _ = int_try_parse(item[:4].replace(".", ""))

    song_list.append(
            {
                "position": order,
                "artist": item[point_position + 1:].strip().upper(),
                "music": ""
            }
        )

save_new_file(song_list, year)
print(f"Lista de {file} está como {validate_ammount(song_list)} tem {len(song_list)} registros")


"""

'\n\nyear = "2005"\npath_to_file = f"{folder_original}{year}.txt"\ncontents = read_file_on_line(path_to_file)\npattern = "[\\d]{1,}[\\.][\\s][\\D]{1,}"\nsong_list = []\n\nitens_in_list = re.findall(pattern, contents)\n\nfor item in itens_in_list:\n    point_position = item.find(".")\n    order, _ = int_try_parse(item[:4].replace(".", ""))\n\n    song_list.append(\n            {\n                "position": order,\n                "artist": item[point_position + 1:].strip().upper(),\n                "music": ""\n            }\n        )\n\nsave_new_file(song_list, year)\nprint(f"Lista de {file} está como {validate_ammount(song_list)} tem {len(song_list)} registros")\n\n\n'

## Identificar o nome da banda na lista e separar a música

## Arquivos de **2000 a 2022**

In [19]:
years = [year for year in range(first_year, last_year, 1)]

for year in years:
    path_to_file = f"{folder_original}{year}.txt"
    lines = read_file_lines(path_to_file)
    
    print(path_to_file, len(lines))
    song_list = []
    
    for item in lines:
        pattern = "([\d]{1,3})(.*)"
        itens_in_list = re.findall(pattern, item)
        
        # print(itens_in_list)
                
        order = itens_in_list[0][0]
        artist_music = itens_in_list[0][1].strip()
                
        if artist_music.startswith("-") or artist_music.startswith("."):
            artist_music = artist_music[1:].strip()
        
        try:
            artist = artist_music.split("-")[0].upper().strip()
            music = artist_music.split("-")[1].upper().strip()
        
        except IndexError:
            artist = "??"
            music = "??"
        
        artist = standard_artist_name(artist)
        music  = standard_song_name(artist, music)
        
        song_list.append(
                {
                    "position": order,
                    "artist": artist,
                    "music": music
                }
            )
        
    save_new_file(song_list, year)
    print(f"Lista de {year} está como {validate_ammount(song_list)} tem {len(song_list)} registros")


resources/bases/original/2000.txt 500
Jethro Tull-Too Old To Rock 'n' Roll Too Young To Die
JETHRO TULL
TOO OLD TO ROCK 'N' ROLL TOO YOUNG TO DIE
Chicago-Questions 67 & 68
CHICAGO
QUESTIONS 67 & 68
Santana-Soul Sacrifice (Studio)
SANTANA
SOUL SACRIFICE (STUDIO)
Joe Cocker-Don't Let Me Be Misunderstood
JOE COCKER
DON'T LET ME BE MISUNDERSTOOD
Gary "Us" Bonds-Quarter  To Three
GARY "US" BONDS
QUARTER  TO THREE
Grass Roots-Temptation Eyes
GRASS ROOTS
TEMPTATION EYES
Bachman-Tuner Overdrive-You Ain´t Seen Nothin' Yet
BACHMAN
TUNER OVERDRIVE
Donovan-Sunshine Superman
DONOVAN
SUNSHINE SUPERMAN
Paul McCartney-Silly Love Songs
PAUL MCCARTNEY
SILLY LOVE SONGS
Ramones-Sheena Is A Punk Rocker
RAMONES
SHEENA IS A PUNK ROCKER
Dexy's Midnight Runners-Come On Eileen
DEXY'S MIDNIGHT RUNNERS
COME ON EILEEN
Yes-Love Will Find A Way
YES
LOVE WILL FIND A WAY
Elton John-Crocodile Rock
ELTON JOHN
CROCODILE ROCK
Rod Stewart-Every Picture Tells A Story
ROD STEWART
EVERY PICTURE TELLS A STORY
Jan and Dean-Surf

## Validar qual posição está faltando ou está duplicada no arquivo tratado

In [20]:
positions_real = [i for i in range(1, 501)]

for year in range(first_year, last_year, 1):
    path_to_file = f"{folder_threated}{year}_ok.txt"
    lines = read_file_lines(path_to_file)
    positions = [int(line.split("|")[0]) for line in lines]
    
    positions_not_found = []
    
    if len(positions) != positions_real:
        positions_not_found = [pos for pos in positions_real if pos not in positions]
        
    if positions_not_found:
        print(f"No arquivo {year}_ok.txt não foram encontradas as posições {positions_not_found}\n")
    

## Identificar posições que estão faltando

Os ajustes foram feitos manualmente nos arquivos originais, usando o arquivo do Eduardo como base.

In [21]:
for year in range(first_year, last_year, 1):
    path_to_file = f"{folder_threated}{year}_ok.txt"
    lines = read_file_lines(path_to_file)
    positions = [int(line.split("|")[0]) for line in lines]   
    
    if any(positions.count(element) > 1 for element in positions):       
        print(f"No arquivo {year}_ok.txt tem posições duplicadas. São eles: \n")
        print([item for item, count in Counter(positions).items() if count > 1], "\n")
