# The `DataFrame`

The DataFrame data structure is the primary object that you'll be working  with in data analysis and cleaning tasks.

The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of  content, with each column having a label. In fact, the distinction between a column and a row is really only a  conceptual distinction. And you can think of the DataFrame itself as simply a two-axes labeled array.

In [2]:
-- Lets start by importing our pandas library
import qualified DataFrame as D

In [4]:
-- I'm going to jump in with an example. Lets create three school records for students and their 
-- class grades. I'll create each as a series which has a student name, the class name, and the score. 

record1 = [D.toAny "school1", D.toAny "Alice", D.toAny "Physics", D.toAny 85]
record2 = [D.toAny "school2", D.toAny "Jack", D.toAny "Chemistry", D.toAny 82]
record3 = [D.toAny "school3", D.toAny "Helen", D.toAny "Biology", D.toAny 90]

In [5]:
-- We can create a dataframe frow our list of records.
df = D.fromRows ["School", "Name", "Class", "Score"] [record1, record2, record3]

-- Now let's look to verify that our data is as expected.
df

----------------------------------------------------------------------
School<br>[Char] | Name<br>[Char] | Class<br>[Char] | Score<br>Integer
-----------------|----------------|-----------------|-----------------
school1          | Alice          | Physics         | 85              
school2          | Jack           | Chemistry       | 82              
school3          | Helen          | Biology         | 90              


In [None]:
-- You'll notice here that Jupyter creates a nice bit of HTML to render the results of the
-- dataframe.

In [6]:
-- We can also define the same dataframe as a list of columns.
{-# LANGUAGE OverloadedStrings #-}
import Data.Text (Text)

-- Tells the compiler that when unspecified number looking things should be int (not Integer) 
-- and text looking things should be text (not String).
default (Int, Text) 

df = D.fromNamedColumns [ ("School", D.fromList ["school1", "school2", "school3"])
                        , ("Name", D.fromList ["Alice", "Jack", "Helen"])
                        , ("Score", D.fromList [85,82,90])]

df

--------------------------------------------
School<br>Text | Name<br>Text | Score<br>Int
---------------|--------------|-------------
school1        | Alice        | 85          
school2        | Jack         | 82          
school3        | Helen        | 90          


In [8]:
--if we wanted to select data associated with school2, we would just query the 
-- with the `filter` function.
import qualified DataFrame.Functions as F

school = F.col @Text "School"

D.filter school (== "school2") df

--------------------------------------------
School<br>Text | Name<br>Text | Score<br>Int
---------------|--------------|-------------
school2        | Jack         | 82          


In [None]:
-- Adding a new column to the DataFrame is as easy as assigning it to some value using
-- the indexing operator. For instance, if we wanted to add a class ranking column with default
-- value of None, we could do so by using the insertWithDefault operator.
-- This broadcasts the default value to the new column immediately.
import qualified Data.Vector as V

D.insertVectorWithDefault 0 "Class Ranking" V.empty df

: 

In [11]:
:t D.insertVectorWithDefault