# Pandas Intro
The first step is to complete [this great tutorial](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/) which will introduce you to the wonderful data manipulation package called [Pandas](https://pandas.pydata.org/).

A few comments/modifications to the tutorial:
*   You do **not** need to install pandas if you use Google Colab, just `import pandas as pd` and you're good to go.
*   Using Google Colab can make file input/output somewhat annoying so feel free to skip the *How to read in data* section
*   To import the `IMDB-Movie-Data.csv` data set, use the following code: 
```
url = "https://raw.githubusercontent.com/LearnDataSci/articles/master/Python%20Pandas%20Tutorial%20A%20Complete%20Introduction%20for%20Beginners/IMDB-Movie-Data.csv"
df = pd.read_csv(url)
```




# Homemade Expansion on Pandas Intro
Below are some homemade tutorials that expand on the skills you learned in the **Learn Data Science** tutorial. Try these after you've completed the tutorial above.

## Which movie genres tend to have the highest/lowest success on IMDB?
We will answer this question using 3 different measures of success: **metascore**, **rating**, and **revenue**. We provide some guidelines on how to answer this question below.

In [None]:
###################
### Walkthrough ###
###################

### First, we calculate the mean and standard deviation metascore/rating/revenue 
### of movies with each genre tag

# Import the following packages: 
    # numpy (as np), pandas (as pd), matplotlib.pyplot (as plt)

# Make sure you've imported the IMDB-Movie-Data.csv file and created a dataframe with it
    # Use the same code from above where we import the data directly from github

# Obtain a list of all *unique* genres which are present in the data set
    # Extract the genres column (this is a list of lists) 
        # Hint: df['column_name'] 
    # Flatten this data into a list of strings (those strings being genres)
        # Hint: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flatten.html
    # Cast the list as a set to get rid of the duplicates
    # Cast the set back into a list so we can iterate on it

# For each genre
    # extract all entries (using conditional selection) for movies in that genre
    # Calculate the mean and stdev of metascore, rating, and revenue for that genre
        # Hint: use np.mean (https://numpy.org/doc/stable/reference/generated/numpy.mean.html), 
        #       and np.std,(https://numpy.org/doc/stable/reference/generated/numpy.std.html)
    # Hint: See example 6 of https://www.programiz.com/python-programming/list-comprehension
    # Hint: Use an f-string to name the column in df https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python

# Conglomerate this data in another dataframe, as shown below
#   genre  | mean_metascore | mean_rating | mean_revenue | std_metascore |...
#   -------|----------------|-------------|------------------------------|...
#   Horror | 55.2           | 6.8         | 227.7        | 4.5           |...
#   ...


# Now for each measure of success (mean_metascore, mean_rating, mean_revenue),
    # Sort the dataset by that column
    # Create a scatter plot of genre (x) vs measure of success (y) with the standard deviation plotted as error bars    
      # Hint: use plt.errorbar

# What kinds of surprising discoveries or interesting conclusions can you glean from these plots?

# Now we want to know if these three measures of success correlate with each
# other. To figure that out, we're going to make scatterplots comparing the 
# revenue, ratings, and metascore of every movie in the dataset, e.g., ratings vs revenue --> Do highly rated movies make more money? 
    # Hint: > Use plt.scatter (https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.scatter.html)
    #       > Use the plt.scatter alpha argument (alpha=0.2) to make the points 
    #         transparent so you can see the distribution more clearly
    # Fit a linear trend line with np.polyfit (https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html)

In [None]:
### Code away!

# More Advanced Tutorials
Free
* [EdX - Python for Data Science](https://www.edx.org/course/python-for-data-science-2)
* [EdX - Probability and Statistics in Data Science](https://www.edx.org/course/probability-and-statistics-in-data-science-using-p)
* https://towardsdatascience.com/top-9-data-science-projects-for-a-beginner-in-2020-26eb7d42b116
* https://data-flair.training/blogs/python-deep-learning-project-handwritten-digit-recognition/
* https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_ecosystem.htm
* https://analyticsindiamag.com/popular-data-science-projects-for-aspiring-data-scientists/


Paid
* [DataCamp - Intro to Data Science in Python](https://www.datacamp.com/courses/introduction-to-data-science-in-python)
* [DataCamp - Supervised Learning with scikit-learn](https://www.datacamp.com/courses/supervised-learning-with-scikit-learn)