# pandas

# pandas

-   pandas is compiled of data structure and data manipulation tools to
    assist with data cleaning and analysis

-   pandas is typically used in addition to NumPy, SciPy, and matplotlib

pandas has two data structures: Series and DataFrame

## Series

-   Series are one dimensional array-like objects that contain values of
    the same type and has data labels called its index

    ``` {r}
    library(reticulate)
    ```

In [1]:
import pandas as pd
from pandas import Series, DataFrame

ser = pd.Series([4, 7, -5, 3, "NA"])

print(ser)

0     4
1     7
2    -5
3     3
4    NA
dtype: object

## DataFrame

-   A dataframe is a rectangular table of data that contains ordered
    named columns

-   Each column can be a different data type ( numeric, string, Boolean,
    etc.)

-   DataFrames have both row and column index

In [2]:
data = {"car": ["Volvo", "Volkswagen", "Dodge", "GMC", "Chevy", "Mazda"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "cylinders": [1, 1, 3, 2, 2, 3]}
frame = pd.DataFrame(data)

print(frame)

#head is used to display the first 5 lines
print(frame.head)

#tail is used to display the last 5 lines of the data frame

print(frame.tail)

#You can arrange the column order by specifying the column names

frame2 = pd.DataFrame(data, columns=["car", "year", "cylinders"])

print(frame2)

#you can add an empty column, but it will appear with missing values

frame3 = pd.DataFrame(data, columns=["car", "year", "cylinders", "accidents"])

print(frame3)

#you can retrieve a column by using a [] notation or a dot/. notation

print(frame3["year"])
print(frame3.year)

#rows can be retrieved by position or name with the iloc and loc attributes 

print(frame3.loc[1])

print(frame3.loc[2])

#Columns can also be modified by assigning values 
print(frame3)

frame3["accidents"]= 9

print(frame3)

          car  year  cylinders
0       Volvo  2000          1
1  Volkswagen  2001          1
2       Dodge  2002          3
3         GMC  2001          2
4       Chevy  2002          2
5       Mazda  2003          3
<bound method NDFrame.head of           car  year  cylinders
0       Volvo  2000          1
1  Volkswagen  2001          1
2       Dodge  2002          3
3         GMC  2001          2
4       Chevy  2002          2
5       Mazda  2003          3>
<bound method NDFrame.tail of           car  year  cylinders
0       Volvo  2000          1
1  Volkswagen  2001          1
2       Dodge  2002          3
3         GMC  2001          2
4       Chevy  2002          2
5       Mazda  2003          3>
          car  year  cylinders
0       Volvo  2000          1
1  Volkswagen  2001          1
2       Dodge  2002          3
3         GMC  2001          2
4       Chevy  2002          2
5       Mazda  2003          3
          car  year  cylinders accidents
0       Volvo  2000          

-   Let’s try to use pandas with our AOU Rural Survey csv file

    ``` python
    pd.read_csv("AOURuralSurvey.csv")

    csv_1 = pd.read_csv("AOURuralSurvey.csv")

    print(csv_1.head())

    #get column names

    print(csv_1.columns)

    #check to see is columns are null

    print(csv_1.isnull())

    #a nicer way to see this is to sum the null 

    print(csv_1.isnull().sum())

    #we can also drop the rows containing NA  values from our dataframe 

    csv_2 = csv_1.dropna() 

    print(csv_2)

    #OOPS we dropped every row in our data frame! Let's try again.

    #This time we will only drop the columns containing NA

    csv_3 = csv_1.dropna(1)
    #Note pandas will be updating soon and the column name will replace the number 1

    #we can also fill the na values in our dataframes

    csv_4 =  csv_1.fillna({"survey_version_name":"version2.2"})

    print(csv_4)
    ```

           person_id          survey_datatime                             survey  \
        0          1  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        1          2  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        2          3  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        3          4  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        4          5  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   

           question_concept_id                          question  answer_concept_id  \
        0             43530268  Delayed Medical Care: Rural Area           43529416   
        1             43530268  Delayed Medical Care: Rural Area           43530110   
        2             43530268  Delayed Medical Care: Rural Area           43529416   
        3             43530268  Delayed Medical Care: Rural Area           43530110   
        4             43530268  Delayed Medical Care: Rural Area           43529416   

                                answer  survey_version_concept_id  survey_version_name  
        0  Delayed Care Rural Area: No                        NaN                  NaN  
        1  Delayed Care Rural Area:Yes                        NaN                  NaN  
        2  Delayed Care Rural Area: No                        NaN                  NaN  
        3  Delayed Care Rural Area:Yes                        NaN                  NaN  
        4  Delayed Care Rural Area: No                        NaN                  NaN  
        Index(['person_id', 'survey_datatime', 'survey', 'question_concept_id',
               'question', 'answer_concept_id', 'answer', 'survey_version_concept_id',
               'survey_version_name'],
              dtype='object')
            person_id  survey_datatime  survey  question_concept_id  question  \
        0       False            False   False                False     False   
        1       False            False   False                False     False   
        2       False            False   False                False     False   
        3       False            False   False                False     False   
        4       False            False   False                False     False   
        5       False            False   False                False     False   
        6       False            False   False                False     False   
        7       False            False   False                False     False   
        8       False            False   False                False     False   
        9       False            False   False                False     False   
        10      False            False   False                False     False   
        11      False            False   False                False     False   
        12      False            False   False                False     False   
        13      False            False   False                False     False   
        14      False            False   False                False     False   
        15      False            False   False                False     False   
        16      False            False   False                False     False   
        17      False            False   False                False     False   
        18      False            False   False                False     False   
        19      False            False   False                False     False   
        20      False            False   False                False     False   
        21      False            False   False                False     False   

            answer_concept_id  answer  survey_version_concept_id  survey_version_name  
        0               False   False                       True                 True  
        1               False   False                       True                 True  
        2               False   False                       True                 True  
        3               False   False                       True                 True  
        4               False   False                       True                 True  
        5               False   False                       True                 True  
        6               False   False                       True                 True  
        7               False   False                       True                 True  
        8               False   False                       True                 True  
        9               False   False                       True                 True  
        10              False   False                       True                 True  
        11              False   False                       True                 True  
        12              False   False                       True                 True  
        13              False   False                       True                 True  
        14              False   False                       True                 True  
        15              False   False                       True                 True  
        16              False   False                       True                 True  
        17              False   False                       True                 True  
        18              False   False                       True                 True  
        19              False   False                       True                 True  
        20              False   False                       True                 True  
        21              False   False                       True                 True  
        person_id                     0
        survey_datatime               0
        survey                        0
        question_concept_id           0
        question                      0
        answer_concept_id             0
        answer                        0
        survey_version_concept_id    22
        survey_version_name          22
        dtype: int64
        Empty DataFrame
        Columns: [person_id, survey_datatime, survey, question_concept_id, question, answer_concept_id, answer, survey_version_concept_id, survey_version_name]
        Index: []
            person_id          survey_datatime                             survey  \
        0           1  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        1           2  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        2           3  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        3           4  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        4           5  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        5           6  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        6           7  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        7           8  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        8           9  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        9          10  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        10         11  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        11         12  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        12         13  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        13         14  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        14         15  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        15         16  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        16         17  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        17         18  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        18         19  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        19         20  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        20         21  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   
        21         22  2019-09-06 18:56:49 UTC  Healthcare Access and Utilization   

            question_concept_id                          question  answer_concept_id  \
        0              43530268  Delayed Medical Care: Rural Area           43529416   
        1              43530268  Delayed Medical Care: Rural Area           43530110   
        2              43530268  Delayed Medical Care: Rural Area           43529416   
        3              43530268  Delayed Medical Care: Rural Area           43530110   
        4              43530268  Delayed Medical Care: Rural Area           43529416   
        5              43530268  Delayed Medical Care: Rural Area           43530110   
        6              43530268  Delayed Medical Care: Rural Area           43529416   
        7              43530268  Delayed Medical Care: Rural Area           43530110   
        8              43530268  Delayed Medical Care: Rural Area           43529416   
        9              43530268  Delayed Medical Care: Rural Area           43530110   
        10             43530268  Delayed Medical Care: Rural Area           43529416   
        11             43530268  Delayed Medical Care: Rural Area           43530110   
        12             43530268  Delayed Medical Care: Rural Area           43529416   
        13             43530268  Delayed Medical Care: Rural Area           43530110   
        14             43530268  Delayed Medical Care: Rural Area           43529416   
        15             43530268  Delayed Medical Care: Rural Area           43530110   
        16             43530268  Delayed Medical Care: Rural Area           43529416   
        17             43530268  Delayed Medical Care: Rural Area           43530110   
        18             43530268  Delayed Medical Care: Rural Area           43529416   
        19             43530268  Delayed Medical Care: Rural Area           43530110   
        20             43530268  Delayed Medical Care: Rural Area           43529416   
        21             43530268  Delayed Medical Care: Rural Area           43530110   

                                 answer  survey_version_concept_id survey_version_name  
        0   Delayed Care Rural Area: No                        NaN          version2.2  
        1   Delayed Care Rural Area:Yes                        NaN          version2.2  
        2   Delayed Care Rural Area: No                        NaN          version2.2  
        3   Delayed Care Rural Area:Yes                        NaN          version2.2  
        4   Delayed Care Rural Area: No                        NaN          version2.2  
        5   Delayed Care Rural Area:Yes                        NaN          version2.2  
        6   Delayed Care Rural Area: No                        NaN          version2.2  
        7   Delayed Care Rural Area:Yes                        NaN          version2.2  
        8   Delayed Care Rural Area: No                        NaN          version2.2  
        9   Delayed Care Rural Area:Yes                        NaN          version2.2  
        10  Delayed Care Rural Area: No                        NaN          version2.2  
        11  Delayed Care Rural Area:Yes                        NaN          version2.2  
        12  Delayed Care Rural Area: No                        NaN          version2.2  
        13  Delayed Care Rural Area:Yes                        NaN          version2.2  
        14  Delayed Care Rural Area: No                        NaN          version2.2  
        15  Delayed Care Rural Area:Yes                        NaN          version2.2  
        16  Delayed Care Rural Area: No                        NaN          version2.2  
        17  Delayed Care Rural Area:Yes                        NaN          version2.2  
        18  Delayed Care Rural Area: No                        NaN          version2.2  
        19  Delayed Care Rural Area:Yes                        NaN          version2.2  
        20  Delayed Care Rural Area: No                        NaN          version2.2  
        21  Delayed Care Rural Area:Yes                        NaN          version2.2  

        /var/folders/0c/gwy9zj6922qgn3wm8_zy2gy80000gp/T/ipykernel_75866/230286523.py:29: FutureWarning:

        In a future version of pandas all arguments of DataFrame.dropna will be keyword-only.