# Pandas via Jupyter for manipulating basic tabular data

## Part 1: Getting used to Jupyter and reading in CSV data

Welcome to **Jupyter Notebooks**! You probably installed Juypter via **Anaconda**. Anaconda is a suite of data science resources: one of the apps it installs is Jupyter (another especially useful one is VS Code). Jupyter is a workbook-style approach to implementing **Python** code. Python relies upon **libraries**, which are collections of functions relating to a particular task. **Pandas** is the most popular library aimed at working with tabular data, which can also approximate the functionality of a **relational database**.

That is a mouthful, but the point is that we are going to use Pandas (a Python library) to manipulate and visualize tabular data, usually through the Jupyter interface. Pandas is hardly the only tool we could use to implement a relational database. Microsoft Access, Filemaker, and SQL (the latter of which refers to a language, not a particular app) are all also relational database. Pandas is free, open source, and (in my subjective opinion) easier to learn than the others.

Jupyter Notebooks operate in "cells". Each cell can be executed individually or sequentially to implement Python code. In addition to implementing Python code, a cell (such as this one) can also render Markdown formatting when you wish to focus on the text.

In [1]:
# The pounds sign designates a comment (i.e., not computationally active) within a cell that otherwise executes code.
# Let's try a very basic Python function, which is simply to print text:

print ("this is a bit Python text")

# pretty simple, but note that in any kind of coding the syntax really matters 
# (e.g. that the text you wish to display is in quotation marks, etc.)

this is a bit Python text


"Vanilla" Python can do all sorts of useful things: automate tasks using logical loops, basic mathematics, etc. It can even do some basic stuff with tabular data. But most data scientists move straight to Pandas for a broader range of functionality. So the first thing we need to do is "summon" Pandas. This next step unlocks many of the command we will use going forward.

In [3]:
# Importing the library (it could take a moment to load)
import pandas as pd

As historians, we differentiate between our *primary source*, which we try to keep in its original state, and our notes / transcription / markup of that text. In other words, if you want to work with a Byzantine manuscript, for instance, you could highlights part of the PDF, or transcribe it into a text file, but you would never write on the original document. Data science is no different: usually we *read in* data from a file (a CSV file, for instance), but once it is read in, you are working with a copy of the original document. You can modify the original document if you wish, but you don't have to, and would rarely want to do so.


Let's start by doing something very simple: reading in a CSV file, saving it as a local variable, and then looking at it.

In order to "read in" a file as a local variable, however, we need to know where it "lives," which brings us to the idea of **path**. The files on your computer are organized hierarchically, like a tree. The "path" is the sequence of steps that you have to "walk" in order to go from the **root** to the file you are looking for.


In [4]:
# Read in the dataframe
df = pd.read_csv("/Users/PICKETTJ/Dropbox/Active_Directories/Teaching/Graduate_Seminars/Historical_Methods_spring2025/Class_Exercises_Methods/initial_student_tabular_data/chrvalaandrew_Military_Strength_DDR.csv")


Some notes about this first, basic step:
    
The logic here is that this variable_name (here `df` for 'dataframe', but you can call it anything you like) will now designate the local version of the CSV file, which is now stored in t dataframe (a tabular data type specific to Pandas), found at the following path location.

Therefore, if you modify dataframe `df` in any future code, it will not alter the information in your CSV file (unless you explicitly tell it to). It also means that if you rename or move the source file, the code will deliver an error message if you try to run it again.  
    
    
Now let's check to make sure it read in properly:

In [7]:
# display the dataframe information simply by inputting the variable name and executing the code
df

Unnamed: 0,Component,HQ,Personnel (Peak)
0,NVA,Strausberg,223000
1,LaSK,Geltow,105000
2,VM,Rostock,27000
3,LSK,Strausberg,44000
4,Grenztruppen,Paetz,47000
5,MFSS,Lichtenberg,91000
6,Vopo,Berlin,275000
7,KDA,Berlin,210000
8,,,799000


In [10]:
# get some basic descriptive information about the dataframe
df.describe()

Unnamed: 0,Component,HQ,Personnel (Peak)
count,8,8,9
unique,8,6,9
top,NVA,Strausberg,223000
freq,1,2,1
