<a href="https://colab.research.google.com/github/naomilago/Data-Analysis-with-Python-3-and-Pandas/blob/master/Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style='color:#352459;'>Data Analysis with Python 3 an Pandas</h1>

<span style="color:#352459; font-weight:bold;">NOTE:</span> In this notebook I'm following the <a style='color:#7952B3;' href='https://pythonprogramming.net/introduction-python3-pandas-data-analysis/'>pythonprogramming.net</a> material.

<h2 style='color:#352459;'>Summary</h2>

- <a href='#whatIsDataAnalysis' style='color:#7952B3;'>What is Data Analysis?</a>
- <a href='#whatIsPandas' style='color:#7952B3;'>What is Pandas?</a>
- <a href='#firstSteps' style='color:#7952B3;'>First steps</a>

<h2 id='whatIsDataAnalysis'>What is Data Analysis?</h2>

⠀According to <a style='color:#7952B3;' href='https://monkeylearn.com/blog/data-analysis-examples'>Monkey Learn</a>, __Data Analysis__ is defined by the visualizing, interpreting, cleaning and _analyzing_ data process. This is intended to discover interesting insights that drive smarter and more effective business decisions. 
⠀Monkey Learn has listed in their website the best <a style='color:#7952B3;' href='https://monkeylearn.com/blog/data-analysis-tools/'>Data Analysis Tools</a>, as follow:

<ol style="list-style-type: decimal">
<li><a style='color:#7952B3;' href="https://monkeylearn.com/">MonkeyLearn</a> | Perform no-code text analysis</li>
<li><a style='color:#7952B3;' href="https://monkeylearn.com/blog/data-analysis-tools/#rapidminer">RapidMiner</a> | Build predictive analysis models</li>
<li><a style='color:#7952B3;' href="https://monkeylearn.com/blog/data-analysis-tools/#knime">KNIME</a> | Create data science workflows</li>
<li><a style='color:#7952B3;' href="https://monkeylearn.com/blog/data-analysis-tools/#talend">Talend</a> | Collect your data in a single platform</li>
<li><a style='color:#7952B3;' href="https://monkeylearn.com/blog/data-analysis-tools/#excel">Excel</a> | Use powerful data analysis formulas</li>
<li><a style='color:#7952B3;' href="https://monkeylearn.com/blog/data-analysis-tools/#airtable">Airtable</a> | Part spreadsheet, part database</li>
<li><a style='color:#7952B3;' href="https://monkeylearn.com/blog/data-analysis-tools/#power-bi">Power BI</a> | See your results in real time</li>
<li><a style='color:#7952B3;' href="https://monkeylearn.com/blog/data-analysis-tools/#tableau">Tableau</a> | Visualize your results in style</li>
<li><a style='color:#7952B3;' href="https://monkeylearn.com/blog/data-analysis-tools/#clicdata">ClicData</a> | Connect data and create interactive dashboards</li>
<li><a style='color:#7952B3;' href="https://monkeylearn.com/blog/data-analysis-tools/#r">R</a> | The programming language for exploratory data analysis</li>
<li><a style='color:#7952B3;' href="https://monkeylearn.com/blog/data-analysis-tools/#python">Python</a> | The programming language for magine learning</li>
<li><a style='color:#7952B3;' href="https://monkeylearn.com/blog/data-analysis-tools/#qlik">Qlik</a> | Perform in-memory data processing</li>
<li><a style='color:#7952B3;' href="https://monkeylearn.com/blog/data-analysis-tools/#sas">SAS Business Intelligence</a> | Easy-to-understand visualizations</li>
<li><a style='color:#7952B3;' href="https://monkeylearn.com/blog/data-analysis-tools/#looker">Looker</a> | Tailored analyticvs solutions</li>
<li><a style='color:#7952B3;' href="https://monkeylearn.com/blog/data-analysis-tools/#sql">SQL Programming Language</a> | Easily organize structured data</li>
</ol>

<h2 id='whatisPandas'>What is Pandas?</h2>

⠀According to <a style='color:#7952B3;' href='https://www.activestate.com/resources/quick-reads/what-is-pandas-in-python-everything-you-need-to-know/'>ActiveState</a>, Pandas is an open source Python package that is most widely used for Data Science/Data Analysis and Machine Learning tasks. It is build on top of another packaged named <a style='color:#7952B3;' href='https://www.activestate.com/products/python/python-packages/'>Numpy</a>, which provides support for multi-dimensional arrays. 
<br />
⠀As one of the most popular data wrangling packages, Pandas works well with many other data science modules inside the Python ecosystem, and is typically included in every Python distribution, from those that come with your OS to commercial vendor distribuitions like ActiveState's <a style='color:#7952B3;' href='https://platform.activestate.com/featured-projects/'>ActivePython</a>.
<br />
⠀Pandas makes it simple to do many of the time consuming, repetitive tasks associated with working with data, including:

- Data cleasing
- Data fill
- Data normalization
- Merges and joins
- Data visualization
- Statistical analysis
- Data inspection
- Loading and saving data
- And much more

<h2 id='firstSteps'>First steps</h2>

⠀Here we can see some of the basic commands for us to have our first steps with Pandas. <br />
⠀The first thing we have to do is to ```import``` the ```pandas``` library in our code, and that we can use ```pd``` to deal with pandas.

In [None]:
import pandas as pd # 'pd' is a common sense abbreviation for this library. 

⠀Now, I want to read a .csv file named `avocado.csv` that you can find it right <a style='color:#7952B3;' href='https://github.com/naomilago/Data-Analysis-with-Python-3-and-Pandas/blob/master/Assets/avocado.csv'>here</a>. <br />
⠀After getting the file uri, I'll attribute it to a variable (`uri`) before using it with pandas. <br />
⠀Then, I'll create a Data Frame variable (`dataFrame`) that contains the instruction to read a .csv file. <br />
⠀Note that we're using the keyword `pd` which stands for the library we just imported with the function `read_csv()` - responsible for the data reading. 
⠀

In [None]:
uri = 'https://raw.githubusercontent.com/naomilago/Data-Analysis-with-Python-3-and-Pandas/master/Assets/avocado.csv'

dataFrame = pd.read_csv(uri)

⠀We might have so many rows in this file, so before showing it, let's know two other functions: `head()` and `tail()`. <br />
- `head()` -> Shows the first 5 elements
- `tail()` -> Shows the last 5 elements <br />

⠀PS.: We can pass the number we want as a parameter to show more or less elementens (works for both functions).

In [None]:
dataFrame.head(3)

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany


In [None]:
dataFrame.tail(6)

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
18243,6,2018-02-11,1.57,15986.17,1924.28,1368.32,0.0,12693.57,12437.35,256.22,0.0,organic,2018,WestTexNewMexico
18244,7,2018-02-04,1.63,17074.83,2046.96,1529.2,0.0,13498.67,13066.82,431.85,0.0,organic,2018,WestTexNewMexico
18245,8,2018-01-28,1.71,13888.04,1191.7,3431.5,0.0,9264.84,8940.04,324.8,0.0,organic,2018,WestTexNewMexico
18246,9,2018-01-21,1.87,13766.76,1191.92,2452.79,727.94,9394.11,9351.8,42.31,0.0,organic,2018,WestTexNewMexico
18247,10,2018-01-14,1.93,16205.22,1527.63,2981.04,727.01,10969.54,10919.54,50.0,0.0,organic,2018,WestTexNewMexico
18248,11,2018-01-07,1.62,17489.58,2894.77,2356.13,224.53,12014.15,11988.14,26.01,0.0,organic,2018,WestTexNewMexico


⠀What about filtering to just one table? Let's do it now.

In [None]:
dataFrame['AveragePrice'].head()

0    1.33
1    1.35
2    0.93
3    1.08
4    1.28
Name: AveragePrice, dtype: float64

⠀We also have a filter for a specific value in a column. Let's create a new data frame to list just Spokane regions.

In [None]:
spokane_dataFrame = dataFrame[ dataFrame['region'] == 'Spokane']
spokane_dataFrame.head()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
2444,0,2015-12-27,1.05,67099.38,20577.16,21592.44,2889.97,22039.81,21900.09,0.0,139.72,conventional,2015,Spokane
2445,1,2015-12-20,1.12,61555.62,17819.26,16576.75,2660.37,24499.24,24499.24,0.0,0.0,conventional,2015,Spokane
2446,2,2015-12-13,0.99,67431.18,22229.24,20738.68,2189.59,22273.67,22269.39,4.28,0.0,conventional,2015,Spokane
2447,3,2015-12-06,0.85,100233.67,18780.0,39234.39,2758.78,39460.5,38946.79,513.71,0.0,conventional,2015,Spokane
2448,4,2015-11-29,1.17,51432.09,16876.93,16826.84,2523.85,15204.47,15204.47,0.0,0.0,conventional,2015,Spokane


⠀There's an interesting thing to note called `index`. Indexes are some kind of identifiers and you can see them in the "Unnamed: 0" collumn above or in this list below.

In [None]:
spokane_dataFrame.index

Int64Index([ 2444,  2445,  2446,  2447,  2448,  2449,  2450,  2451,  2452,
             2453,
            ...
            18167, 18168, 18169, 18170, 18171, 18172, 18173, 18174, 18175,
            18176],
           dtype='int64', length=338)

⠀Now, if we want to see all the unique values from a table, guess what we need to use? Exactly. An `unique()` function.

In [None]:
dataFrame['region'].unique()

array(['Albany', 'Atlanta', 'BaltimoreWashington', 'Boise', 'Boston',
       'BuffaloRochester', 'California', 'Charlotte', 'Chicago',
       'CincinnatiDayton', 'Columbus', 'DallasFtWorth', 'Denver',
       'Detroit', 'GrandRapids', 'GreatLakes', 'HarrisburgScranton',
       'HartfordSpringfield', 'Houston', 'Indianapolis', 'Jacksonville',
       'LasVegas', 'LosAngeles', 'Louisville', 'MiamiFtLauderdale',
       'Midsouth', 'Nashville', 'NewOrleansMobile', 'NewYork',
       'Northeast', 'NorthernNewEngland', 'Orlando', 'Philadelphia',
       'PhoenixTucson', 'Pittsburgh', 'Plains', 'Portland',
       'RaleighGreensboro', 'RichmondNorfolk', 'Roanoke', 'Sacramento',
       'SanDiego', 'SanFrancisco', 'Seattle', 'SouthCarolina',
       'SouthCentral', 'Southeast', 'Spokane', 'StLouis', 'Syracuse',
       'Tampa', 'TotalUS', 'West', 'WestTexNewMexico'], dtype=object)

In [None]:
dataFrame