# Uploading and Inspecting Data

The purpose of this notebook is to access data of different types, convert it into a pandas data frame and then do some basic interrogation and inspection of it, before moving on to work with it at a more granular level.

The first thing we need to do is actually access the data (stored in `~/data`). Use the cell below to select the dataset that you'd like to work with! All the data is same, but each dataset is in a different format so you can practive working with whatever format suits you.

In [17]:
import pandas as pd

# This tutorial will use the CSV data default, but you may wish to use a different format. Just uncomment out the line which you'd like to work with!
file_path = '../data/data.csv'
# file_path = '../data/data.json'
# file_path = '../data/data.md'
# file_path = '../data/data.txt'
# file_path = '../data/data.xml'

## Having the first look

`pandas` provides a number of easy-to-use functions to convert your data into a DataFrame. Read more about [DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) if you are not familiar with this terminolgy.

Put simply, a DataFrame is like a table created in memory that you can work with similarly to those you'd find in a database, with certain small but important differences that we'll explore later.

In [25]:
# we'll use the read_{file_type} function to convert our raw data into a DataFrame. Again, uncomment the line which applies to the data format that you'd like to work with.
df = pd.read_csv(file_path)
# df = pd.read_json(file_path)
# df = pd.read_xml(file_path)

print(df)

                        search_query        time platform
0              vacation spots recipe  1722760240   mobile
1           what is invest in crypto  1734252942   mobile
2                    AI tools recipe  1728010297  desktop
3               best camera symptoms  1730697978   mobile
4            best meditation near me  1735175397   mobile
...                              ...         ...      ...
9995     best vacation spots near me  1724048159   mobile
9996                best flu near me  1712704022  desktop
9997                can I cook pasta  1735612619  desktop
9998     best get a mortgage near me  1723209855  desktop
9999  best electric vehicles near me  1724506448   mobile

[10000 rows x 3 columns]


### A note about `.txt` and `.md` formats

There is not pandas-native way to directly load files in Markdown (`.md`) or plaintext (`.txt`), but this is a really easy fix. Use the cells below if you'd like to work with those files types (otherwise, feel free to skip)

In [16]:
import pandas as pd

# Load a tab-separated .txt file into a DataFrame
df = pd.read_csv(file_path, sep="\t")

print(df)

                        search_query        time platform
0              vacation spots recipe  1722760240   mobile
1           what is invest in crypto  1734252942   mobile
2                    AI tools recipe  1728010297  desktop
3               best camera symptoms  1730697978   mobile
4            best meditation near me  1735175397   mobile
...                              ...         ...      ...
9995     best vacation spots near me  1724048159   mobile
9996                best flu near me  1712704022  desktop
9997                can I cook pasta  1735612619  desktop
9998     best get a mortgage near me  1723209855  desktop
9999  best electric vehicles near me  1724506448   mobile

[10000 rows x 3 columns]


In [24]:
# We can do the same thing with Markdown files, but we'll need a few more steps to do so

import pandas as pd
from io import StringIO

with open("../data/data.md", "r") as f:
    md_text = f.read()

# Clean the Markdown to get just the table
lines = md_text.strip().splitlines()
clean_lines = [
    line for line in lines 
    if not set(line.strip()).issubset(set("|:- "))  # this filters out rows like |:------|----:|
]

# Convert Markdown-style table into CSV-style text
csv_text = "\n".join([line.replace("|", ",")[1:-1].strip() for line in clean_lines])

# Load it using StringIO
data = StringIO(csv_text)

# Step 5: Create DataFrame
df = pd.read_csv(data)

# Show the first few rows
df.head()


Unnamed: 0,search_query,time,platform
0,vacation spots recipe,1722760240,mobile
1,what is invest in crypto,1734252942,mobile
2,AI tools recipe,1728010297,desktop
3,best camera symptoms,1730697978,mobile
4,best meditation near me,1735175397,mobile


## Conclusions

Okay great! We've got our dataset from whichever format we've chosen loaded, and now we can work with it in the [next notebook](../basic-data-interrogation/notebook.ipynb)!