## HW 1: Haunted Places

#### CSV to TSV Converter
This python script converts the CSV file to TSV format using pandas. The script handles error cases and maintains the original data structure while changing the delimiter.

In [1]:
import pandas as pd

file_name = "haunted_places"

def convert_csv_to_tsv(input_csv, output_tsv):
    try:
        # Read the CSV file
        df = pd.read_csv(input_csv)
        
        # Write to TSV file
        df.to_csv(output_tsv, sep='\t', index=False)
        print(f"Successfully converted {input_csv} to {output_tsv}")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

if __name__ == "__main__":
    # Replace these with your actual file names
    input_csv_file = f"{file_name}.csv"
    output_tsv_file = f"{file_name}.tsv"
    
    convert_csv_to_tsv(input_csv_file, output_tsv_file)

Successfully converted haunted_places.csv to haunted_places.tsv


In [2]:
df_1 = _dntk.execute_sql(
  'SELECT *\nFROM \'haunted_places.tsv\'',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled'
)
df_1

Unnamed: 0,city,country,description,location,state,state_abbrev,longitude,latitude,city_longitude,city_latitude
0,Ada,United States,Ada witch - Sometimes you can see a misty blue...,Ada Cemetery,Michigan,MI,-85.504893,42.962106,-85.495480,42.960727
1,Addison,United States,A little girl was killed suddenly while waitin...,North Adams Rd.,Michigan,MI,-84.381843,41.971425,-84.347168,41.986434
2,Adrian,United States,If you take Gorman Rd. west towards Sand Creek...,Ghost Trestle,Michigan,MI,-84.035656,41.904538,-84.037166,41.897547
3,Adrian,United States,"In the 1970's, one room, room 211, in the old ...",Siena Heights University,Michigan,MI,-84.017565,41.905712,-84.037166,41.897547
4,Albion,United States,Kappa Delta Sorority - The Kappa Delta Sororit...,Albion College,Michigan,MI,-84.745177,42.244006,-84.753030,42.243097
...,...,...,...,...,...,...,...,...,...,...
10987,Westminster,United States,at 12 midnight you can see a lady with two lit...,city hall,Colorado,CO,-105.048936,39.862610,-105.037205,39.836653
10988,Westminster,United States,Is haunted by the victims of a murder that hap...,Pillar of Fire,Colorado,CO,-105.032091,39.847237,-105.037205,39.836653
10989,Wheat Ridge,United States,The institution was for kids 18 years old and ...,Ridge Mental Institution,Colorado,CO,-105.063974,39.769726,-105.077206,39.766098
10990,Wheat Ridge,United States,Gymnasium - their have been reports of a litt...,Wheat Ridge Middle School,Colorado,CO,-105.103613,39.764055,-105.077206,39.766098


#### Haunted Places Evidence Analyzer
The below script processes the TSV file containing haunted places data and includes additional evidence columns. The script analyzes descriptions to extract audio evidence, visual evidence, dates, witness counts, and time of day, then outputs an updated TSV file with these new columns.

In [19]:
import pandas as pd
import datefinder
from datetime import datetime
import number_parser

def add_evidence_columns(input_file_path, output_file_path, audio_keywords, visual_keywords):

    try:
        df = pd.read_csv(input_file_path, sep='\t')

        # Create the 'audio_evidence' column
        df['audio_evidence'] = df['description'].apply(
            lambda description: isinstance(description, str) and any(keyword in description.lower() for keyword in audio_keywords)
        )

        # Create the 'visual_evidence' column
        df['visual_evidence'] = df['description'].apply(
            lambda description: isinstance(description, str) and any(keyword in description.lower() for keyword in visual_keywords)
        )

        # Create the 'haunted_places_date' column
        df['haunted_places_date'] = df['description'].apply(
            lambda description: next(datefinder.find_dates(description), datetime(2025, 1, 1)).strftime('%Y/%m/%d')
            if isinstance(description, str) and any(datefinder.find_dates(description))
            else datetime(2025, 1, 1).strftime('%Y/%m/%d')
        )

        # Create the 'haunted_places_witness_count' column
        df['haunted_places_witness_count'] = df['description'].apply(
            lambda description: parse_witness_count(description) if isinstance(description, str) else 0
        )
        
        # Create the 'time_of_day' column
        df['time_of_day'] = df['description'].apply(
            lambda description: discern_time_of_day(description) if isinstance(description, str) else "Unknown"
        )

        # Save the updated DataFrame to a new TSV file
        df.to_csv(output_file_path, sep='\t', index=False)
        print(f"Successfully created {output_file_path} with 'audio_evidence', 'visual_evidence', 'haunted_places_date', 'haunted_places_witness_count', and 'time_of_day' columns.")

    except FileNotFoundError:
        print(f"Error: File not found at {input_file_path}")
    except pd.errors.EmptyDataError:
        print(f"Error: The file at {input_file_path} is empty.")
    except KeyError:
        print("Error: 'description' column not found in the file.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

def parse_witness_count(description):

    try:
        # Look for phrases like "seen by X people" or "witnessed by X people"
        description_lower = description.lower()
        if "witness" in description_lower or "seen by" in description_lower:
            # Split the description into words
            words = description_lower.split()
            
            # Iterate through the words to find potential witness counts
            for i, word in enumerate(words):
                if word in ["witnessed", "seen"]:
                    # Check if the next word is "by"
                    if i + 1 < len(words) and words[i + 1] == "by":
                        # Attempt to parse the number after "by"
                        try:
                            witness_count = number_parser.parse(words[i + 2])
                            return int(witness_count)
                        except (ValueError, IndexError):
                            pass  # Parsing failed, continue searching
                else:
                    try:
                        witness_count = number_parser.parse(word)
                        if isinstance(witness_count, int):
                            return int(witness_count)
                    except (ValueError, IndexError):
                        pass
        return 0  # No witness count found
    except Exception:
        return 0

def discern_time_of_day(description):

    description_lower = description.lower()
    if any(keyword in description_lower for keyword in ["morning", "sunrise"]):
        return "Morning"
    elif any(keyword in description_lower for keyword in ["evening", "night", "sunset", "dark"]):
        return "Evening"
    elif "dusk" in description_lower:
        return "Dusk"
    else:
        return "Unknown"

# Example usage:
if __name__ == "__main__":
    input_file = 'haunted_places.tsv'
    output_file = 'haunted_places_evidence.tsv'
    audio_keywords_to_search = ['noises', 'sound', 'voices']  # Example audio keywords
    visual_keywords_to_search = ['camera', 'pictures', "visual"]  # Example visual keywords
    add_evidence_columns(input_file, output_file, audio_keywords_to_search, visual_keywords_to_search)

Successfully created haunted_places_evidence.tsv with 'audio_evidence', 'visual_evidence', 'haunted_places_date', 'haunted_places_witness_count', and 'time_of_day' columns.


In [21]:
df_2 = _dntk.execute_sql(
  'SELECT *\nFROM \'haunted_places_evidence.tsv\'',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled'
)
df_2

Unnamed: 0,city,country,description,location,state,state_abbrev,longitude,latitude,city_longitude,city_latitude,audio_evidence,visual_evidence,haunted_places_date,haunted_places_witness_count,time_of_day
0,Ada,United States,Ada witch - Sometimes you can see a misty blue...,Ada Cemetery,Michigan,MI,-85.504893,42.962106,-85.495480,42.960727,False,False,2025-02-20,0,Evening
1,Addison,United States,A little girl was killed suddenly while waitin...,North Adams Rd.,Michigan,MI,-84.381843,41.971425,-84.347168,41.986434,False,False,2025-02-13,0,Unknown
2,Adrian,United States,If you take Gorman Rd. west towards Sand Creek...,Ghost Trestle,Michigan,MI,-84.035656,41.904538,-84.037166,41.897547,False,False,2025-01-01,0,Evening
3,Adrian,United States,"In the 1970's, one room, room 211, in the old ...",Siena Heights University,Michigan,MI,-84.017565,41.905712,-84.037166,41.897547,False,False,1970-02-13,0,Morning
4,Albion,United States,Kappa Delta Sorority - The Kappa Delta Sororit...,Albion College,Michigan,MI,-84.745177,42.244006,-84.753030,42.243097,False,False,2025-01-01,0,Unknown
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10987,Westminster,United States,at 12 midnight you can see a lady with two lit...,city hall,Colorado,CO,-105.048936,39.862610,-105.037205,39.836653,False,False,2025-12-13,0,Evening
10988,Westminster,United States,Is haunted by the victims of a murder that hap...,Pillar of Fire,Colorado,CO,-105.032091,39.847237,-105.037205,39.836653,False,False,2025-01-01,0,Unknown
10989,Wheat Ridge,United States,The institution was for kids 18 years old and ...,Ridge Mental Institution,Colorado,CO,-105.063974,39.769726,-105.077206,39.766098,True,True,2007-02-13,0,Unknown
10990,Wheat Ridge,United States,Gymnasium - their have been reports of a litt...,Wheat Ridge Middle School,Colorado,CO,-105.103613,39.764055,-105.077206,39.766098,False,False,2025-01-01,0,Morning


In [1]:
df_3 = _dntk.execute_sql(
  'SELECT \n    COUNT(CASE WHEN audio_evidence = TRUE THEN 1 END) AS audio_true_count,\n    COUNT(CASE WHEN audio_evidence = FALSE THEN 1 END) AS audio_false_count,\n    COUNT(CASE WHEN visual_evidence = TRUE THEN 1 END) AS visual_true_count,\n    COUNT(CASE WHEN visual_evidence = FALSE THEN 1 END) AS visual_false_count,\n    COUNT(CASE WHEN haunted_places_witness_count > 0 THEN 1 END) AS num_of_lines_with_more_then_0\nFROM haunted_places_evidence.tsv;',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled'
)
df_3

Unnamed: 0,audio_true_count,audio_false_count,visual_true_count,visual_false_count,num_of_lines_with_more_then_0
0,2096,8896,273,10719,7


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=9721ae8a-6e85-461c-a6dc-c5ce91a700f2' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>