# Explore Data Analysis

Look at the structure and characteristics of the dataset by using functions such as:

* [df.head( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) or [df.tail( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html)

* [df.sample( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)

* [df.describe( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)

* [df.info( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html)

* [df[col_name].unique( )](https://pandas.pydata.org/docs/reference/api/pandas.unique.html) and [df[col_name].nunique( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html)

* [df[col_name].value_counts( )](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)



# Import libraries, load the dataframe, explore the data

Write your findings below:


In [2]:
# import libraries, load dataframe, explore the data

import pandas as pd
import numpy as np
df=pd.read_csv('P_WX2013b.csv', dtype={'DATE':object, 'PRCP':object,
'SNWD':object, 'SNOW':object,
'TMAX':object, 'TMIN':object,})

print(df.head(3))

             STATION      DATE   PRCP   SNWD   SNOW   TMAX  TMIN
0               ----      ----   ----   ----   ----   ----   ---
1  GHCND:USW00014764  20130101      0    254      0      0  -117
2  GHCND:USW00014764  20130102      0    254      0    -44  -161


# Rename Columns

Using the [rename( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) function on the dataframe, pass in a dictionary that maps the old column name to the new column name

`df.rename(columns={old_col_A:new_col_A, old_col_B:new_col_B, ...})`

---

Add the `inplace=True` parameter or make a copy of the dataframe and save the change to this copy.

In [3]:
# rename the columns so they are all lower case;
# DO NOT type the {old: new names}, write a function that takes the column names,
# use `df.columns` to get these, and returns a dictionary;
# create a .py file that will be used for helper functions you create and used  
# the throughout course; import this file into this notebook and use it to get the 
# dict that can be passed into `.rename()`
# Add the `inplace=True` parameter or make a copy of the dataframe and save the change to this copy.
inplace=True
def rename_columns(cols):
    col_dict = {}
    for col in cols:
        col_dict[col] = col.lower()
    return col_dict
test=rename_columns(df.columns)
print(test)


{'STATION': 'station', 'DATE': 'date', 'PRCP': 'prcp', 'SNWD': 'snwd', 'SNOW': 'snow', 'TMAX': 'tmax', 'TMIN': 'tmin'}


# Remove Rows/Columns



In Pandas, the [drop( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) method is used to remove rows or columns from a DataFrame.

---

To drop **rows**, you specify the index/name label(s) of the row(s) to be removed. The axis parameter should be set to 0 or 'index' (axis=0 is default so not necessary)

`df.drop(row_name, axis=0). # single row`

`df.drop([row_name1, row_name2, ...], axis=0)  # multiple rows`

---

To drop **columns**, you specify the index/name label(s) of the columns(s) to be removed. The axis parameter should be set to 1 (axis=0 is default so this is necessary)

`df.drop(col_name, axis=1). # single col`

`df.drop([col_name1, col_name2, ...], axis=1)  # multiple cols`

---

Add the `inplace=True` parameter or make a `copy()` of the dataframe and save the change to this copy.



In [4]:
# remove row 0 and the 'station' and 'snwd' columns
#save this change to a copy of the dataframe
inplace=True
df=df.drop(index=0)
df=df.drop(columns=['STATION', 'SNWD'])
print(df.head(3))



       DATE PRCP SNOW TMAX  TMIN
1  20130101    0    0    0  -117
2  20130102    0    0  -44  -161
3  20130103    0    0  -50  -178


# Set/Reset Indices

## Set Index

In Pandas the [set_index( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html) method is used to set one or more existing columns (or arrays of the correct length) as the index of a DataFrame. This allows for more meaningful and efficient data selection and alignment. 

`df.set_index(keys, drop=True)`

   * `keys`: a single column label; an array-like object of the same length as the dataframe
   * `drop`: If `True` (default), the column used as the new index will be removed from the DataFrame's columns. If `False`, they will remain as regular columns in addition to being the index.

   ---

Add the `inplace=True` parameter or make a `copy()` of the dataframe and save the change to this copy.

## Reset Index

The [reset_index( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) method in pandas is used to reset the index of a DataFrame or Series to the default integer-based index (0, 1, 2, ...). This operation converts the current index into a regular column(s) within the DataFrame.

`df.reset_index(**, drop=False)`
   * `drop`: The `drop=True` argument can be used to discard the old index entirely, preventing it from becoming a new column.

---

Add the `inplace=True` parameter or make a `copy()` of the dataframe and save the change to this copy.

In [5]:
# notice when the the row was dropped the index was also dropped
# reset the indices so they are ordered correctly
inplace=True
df=df.reset_index(drop=True)
print(df.head(3))


       DATE PRCP SNOW TMAX  TMIN
0  20130101    0    0    0  -117
1  20130102    0    0  -44  -161
2  20130103    0    0  -50  -178


# Locating/Handling Missing Values



## Locating Missing Values

In pandas, [isnull( )](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) is a function used to detect missing or null values (NaN, None, NaT) within a Series or DataFrame. It returns a boolean object of the same shape, where True indicates a missing value and False indicates a valid value.

`df.isnull()`

This will return the dataframe with `True` or `False` for each value in the dataframe. 

This may not be as helpful as knowing exactly where the `NaN` value occurs.  This can be done by using `isnull()`, on a specific column, within a `loc` call.

`df.loc[df['col_name'].isnull()]`

This will return the row(s), of a specific column, with a `NaN` in it.

In [None]:
# look through the columns to find where there are any missing values
# add .index to the end of the line of code, look at the value returned
# assign each of the returned objects to a variable that can be used later
missing_prcp=df[df['PRCP'].isnull()].index
missing_snow=df[df['SNOW'].isnull()].index
missing_tmax=df[df['TMAX'].isnull()].index
missing_tmin=df[df['TMIN'].isnull()].index
print(missing_prcp)
print(missing_snow)
print(missing_tmax)
print(missing_tmin)


Index([143], dtype='int64')
Index([], dtype='int64')
Index([181], dtype='int64')
Index([3, 243], dtype='int64')


## Handling Missing Values



### Drop Missing Values

The [dropna( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) method is used to remove rows or columns from a DataFrame that contain missing values (represented by NaN or None). It is a fundamental function for data cleaning and manipulation.

`df.dropna()`

The default behavior is to remove all rows that have one or more NaN values.

---

Add the `inplace=True` parameter or make a `copy()` of the dataframe and save the change to this copy.

---

### Replace missing values

The [fillna( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) method in Pandas is used to replace missing values (NaN or NA) in a Series or DataFrame with a specified value or using a particular method.

   * Filling with a scalar value: You can replace all `NaN` values with a single value, such as 0, the mean, or a specific string.

      `df.fillna(val)`

   * Filling with Different Values per Column: You can provide a dictionary to `fillna()` to specify different fill values for different columns.

      `values={col_name1:val1, col_name2:val2, ...}`

      `df.fillna(value=values)`

---

Add the `inplace=True` parameter or make a `copy()` of the dataframe and save the change to this copy.

---

# Duplicate Rows
Duplicate records can distort your analysis by influencing the results in ways that do not accurately show trends and underlying patterns (by producing outliers).

## Identify Duplicates

The [duplicated( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html) method returns a Series with True and False values that describe which rows in the DataFrame are duplicated and not.

`df.duplicated(**kwargs)`

Optional Parameters:

* `subset=column_label(s)`:	A String, or a list, of the column names to include when looking for duplicates.  Default `subset=None` (meaning no subset is specified, and all columns should be included.

---

* add `.sum()` to the end of the line of code to see how many sets of rows are duplicated
* find the duplicated row using `.loc`

In [None]:
# get the index of the row that is a duplicate of the previous row(s)



## Remove Duplicates

The [drop_duplicates( )]() method in Pandas is used to remove duplicate rows from a DataFrame. 

`df.drop_duplicates(**kwargs)`

* `keep`: 'first' to keep the first occurance; 'last' to keep the last; and False to drop all occurances

* `subset`: column or list of columns to base the drop on

---

Add the `inplace=True` parameter or make a `copy()` of the dataframe and save the change to this copy.

Reset the indices after removing rows

---

In [None]:
# drop the duplicated row and reset the indices



# Changing Datatypes

## `.astype()`

[astype( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) is the most direct and versatile method for converting a column to a specific data type.

`df.astype(dtype, **kwargs)`

* `dtype`: Use a str, numpy.dtype, pandas.ExtensionDtype or Python type to cast entire pandas object to the same type. Alternatively, use a mapping, e.g. {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type

---

Assign the change to the same dataframe or to a new one

Use `.info()` to confirm the changes have occurred

In [None]:
# Change the non-date columns to floats



# Assigning Values to Cells

The [at( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) accessor is used to access a single value in a DataFrame by using a row and column label. It is the fastest and most efficient method for this specific task.

`df.at[row_label, column_label]`

* Label-based access: at uses the row and column names (labels) to find a specific cell, not their integer position. This is a key difference from `.iat`.

* Scalar value access: `at` is optimized for getting or setting a single, non-aggregated value.

* High performance: When you only need to access or update one value, at is significantly faster than `.loc` because it has less overhead.

* Write capabilities: You can use `at` to both **get and set** a value.

In [None]:
# practice changing the NaN values you found earlier;
# then replace the NaN in the 'date' column with an appropriate value
# finally convert all dates in the column to datetime objects



In [10]:
# Replace the missing values in the tmax and tmin columns with the average
# temperatures for the months they exist in



In [11]:
# Replace the missing 'prcp' data with the median prcp value for the month
# in which it exists



In [12]:
# Create a copy of the dataframe and remove the row where 'snow' has a missing value



# Assigning Values to an Entire Column


## Direct Assignment

This is the most common and straightforward method. You select the column by its name and assign a single value to it. Pandas will automatically broadcast this single value to all rows in that column.

`df[col_name] = val`

---

If the column doesn't exist it will be created and filled with value

## Using .loc Accessor

The `.loc` accessor allows for label-based indexing and can be used to select all rows and a specific column for assignment. This method is particularly useful when you want to explicitly indicate that you are modifying the original DataFrame.

`df.loc[:, col_name] = value`

## Using assign() method:
The [assign( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html) method creates new columns or modifies existing ones and returns a new DataFrame. This method is useful when you want to avoid modifying the original DataFrame in place.

For a single column:

`df.assign(col_name = value)`

---

`df.assign(dict)`

* `dict`: A dictionary of `col_name: values` pairs

---

`value` is a single value, a list of values of the same length as the dataframe, or a function to compute the values to be assigned



In [13]:
# convert the values for 'prcp' and 'snow' to inches



In [14]:
# create two new columns to to store the temperature conversions to Fahrenheit

