<img src = "https://github.com/barcelonagse-datascience/academic_files/raw/master/bgsedsc_0.jpg">

# Introduction to programming with Python


<img src = "https://www.python.org/static/img/python-logo.png">

# Programming project 1: basic text analytics

Open the file "textfile.txt", which has copied a passage from the book "Learning Python". 

The task is to read the text into Python and do some basic text analysis. In particular, write a Python code that:

+ counts how many sentences there are in the text
+ counts how many words there are in the text
+ finds all the different words in the text and for each computes the frequency of its appearance and stores the output of this analysis in a convenient way so that it is easy later to find out how often a given word appears

As a first step, and as a simpler exercise, do the above for a single paragraph of the text. Do this first and if you do not manage to finish the larger project submit this as your solution. 



In [None]:
# First we load the data
import urllib.request

url = "https://raw.githubusercontent.com/barcelonagse-datascience/academic_files/master/data/textfile.txt"
file = urllib.request.urlopen(url)

txt=file.read().decode()
txt

'Why Do People Use Python?\n\nBecause there are many programming languages available today, this is the usual first question of newcomers. Given that there are roughly 1 million Python users out there at the moment, there really is no way to answer this question with complete accuracy; the choice of development tools is sometimes based on unique constraints or personal preference.\n\nBut after teaching Python to roughly 225 groups and over 3,000 students during the last 12 years, some common themes have emerged. The primary factors cited by Python users seem to be these:\n\nSoftware quality\nFor many, Python’s focus on readability, coherence, and software quality in general sets it apart from other tools in the scripting world. Python code is designed to be readable, and hence reusable and maintainable—much more so than traditional scripting languages. The uniformity of Python code makes it easy to understand, even if you did not write it. In addition, Python has deep support for more 

In [None]:
# In case you have a local file
# First we load the data
# with open('../../Data/textfile.txt') as f:
#     txt = f.read()

# txt

In [None]:
# Task 1: Count sentences
from re import sub

# Get rid of all the \n characters. Everything should just be a space!
txt = ' '.join(txt.split())

# Substitute all ? and ! characters for full stops.
# NOTE: I'm using regular expressions here, but you 
# could use "replace", and in our particular text
# there are only question marks anyways. 
txt = sub('[?|!]', '.', txt)

# # Let's make everything lowercase
txt = txt.lower()

txt

'why do people use python. because there are many programming languages available today, this is the usual first question of newcomers. given that there are roughly 1 million python users out there at the moment, there really is no way to answer this question with complete accuracy; the choice of development tools is sometimes based on unique constraints or personal preference. but after teaching python to roughly 225 groups and over 3,000 students during the last 12 years, some common themes have emerged. the primary factors cited by python users seem to be these: software quality for many, python’s focus on readability, coherence, and software quality in general sets it apart from other tools in the scripting world. python code is designed to be readable, and hence reusable and maintainable—much more so than traditional scripting languages. the uniformity of python code makes it easy to understand, even if you did not write it. in addition, python has deep support for more advanced s

In [None]:
# Get number of sentences

len(txt.split('.'))

30

In [None]:
#Task 2: Count words

# Let's get rid of all puncuation (semicolons, commas, hyphens, etc.) to create simple words
# to do this we remove everything that's not an alphanumeric character, and replace it with a space
# (thus splitting hyphenated words into two and removing apostrophes.)
# NOTE: Again, I'm using regular expressions here, as it is a very powerful way to process
# text, but if you did not know regular expressions, then you could use replace several times,
# it just would not guarantee the same coverage. 

words = sub('[^a-z0-9]', ' ', txt).split()
print(words[0:10])
len(words)

['why', 'do', 'people', 'use', 'python', 'because', 'there', 'are', 'many', 'programming']


565

In [None]:
# Task 3: Now let's count the unique words!

counts = {}

for w in words:
    if counts.get(w):
        counts[w] += 1
    else:
        counts[w] = 1

print(counts)



{'why': 1, 'do': 1, 'people': 1, 'use': 2, 'python': 23, 'because': 2, 'there': 5, 'are': 4, 'many': 3, 'programming': 6, 'languages': 3, 'available': 1, 'today': 2, 'this': 4, 'is': 9, 'the': 16, 'usual': 1, 'first': 2, 'question': 2, 'of': 14, 'newcomers': 1, 'given': 1, 'that': 2, 'roughly': 2, '1': 1, 'million': 1, 'users': 3, 'out': 1, 'at': 1, 'moment': 1, 'really': 1, 'no': 1, 'way': 1, 'to': 13, 'answer': 1, 'with': 7, 'complete': 1, 'accuracy': 1, 'choice': 1, 'development': 2, 'tools': 4, 'sometimes': 1, 'based': 2, 'on': 4, 'unique': 1, 'constraints': 1, 'or': 3, 'personal': 1, 'preference': 1, 'but': 1, 'after': 2, 'teaching': 1, '225': 1, 'groups': 1, 'and': 22, 'over': 4, '3': 1, '000': 1, 'students': 1, 'during': 1, 'last': 1, '12': 1, 'years': 1, 'some': 2, 'common': 1, 'themes': 1, 'have': 1, 'emerged': 1, 'primary': 1, 'factors': 2, 'cited': 1, 'by': 2, 'seem': 1, 'be': 7, 'these': 2, 'software': 4, 'quality': 3, 'for': 6, 's': 4, 'focus': 1, 'readability': 1, 'cohere

In [None]:
# Explore through list
output1= list(counts.items())
output1[0:2]

[('why', 1), ('do', 1)]

In [None]:
# Explore through numpy array
import numpy as np


values=np.array(list(counts.values()))
words=np.array(list(counts.keys()))
print(words[0:6])
print(values[0:6])

['why' 'do' 'people' 'use' 'python' 'because']
[ 1  1  1  2 23  2]


In [None]:
#Explore through dataframe
import pandas as pd
table= pd.DataFrame.from_dict(counts,orient='index')
print(table.head())

         0
why      1
do       1
people   1
use      2
python  23


# Programming project 2 :  Customer of the month


- We have a list of prices for certain products given in the file "supermarket_prices.csv"
- We have a list of transactions from certain customers in a period of a month given in "supermarket_transactions.csv"

Calculate
- How many items each client has purchased
- How many items of each type each client has purchased
- Calculate the total amount spent by each client
- The company that provides the supermarket with bananas wishes to give a prize to the client that has spent the largest proportion of their spending on bananas. Who should win the prize? 
- A marketing company that works with the supermarket is interested to understand better the characteristics of the three people that have spent most of their spending on bananas. For each one of them report the other product that they have spent most of their remaining income on

*Needless to say that eyeballing is OK for making sure your code makes sense, but will not result in full credits for the project. We want a fully automated code. To carry out the project successfully you need to use most the attributes and methods described earlier. The last one is a little tricky*

In [None]:
import pandas as pd

prices = pd.read_csv('https://github.com/barcelonagse-datascience/academic_files/raw/master/data/supermarket_prices.csv')
transactions = pd.read_csv('https://github.com/barcelonagse-datascience/academic_files/raw/master/data/supermarket_transactions.csv')

# if loading locally
# prices = pd.read_csv('../../Data/supermarket_prices.csv')
# transactions = pd.read_csv('../../Data/supermarket_transactions.csv')

def assign_share(df):
    df['spent'] = df.Quantity*df.Price
    df['share'] = df.spent / df.spent.sum()
    return df


df = (transactions
          .groupby(['Buyer', 'Product'])
          .sum() # Sum Quantity, only column left
          .reset_index()
          .merge(prices, how='left', on='Product')
          .groupby('Buyer', as_index=False)
          .apply(assign_share)
          .reset_index(drop=True))

df

Unnamed: 0,Buyer,Product,Quantity,Price,spent,share
0,Emma,apple,25,1.2,30.0,0.121753
1,Emma,banana,26,5.2,135.2,0.548701
2,Emma,potato,14,3.4,47.6,0.193182
3,Emma,tomato,16,2.1,33.6,0.136364
4,Jackson,apple,18,1.2,21.6,0.106509
5,Jackson,orange,28,4.3,120.4,0.593688
6,Jackson,potato,8,3.4,27.2,0.134122
7,Jackson,tomato,16,2.1,33.6,0.16568
8,John,apple,7,1.2,8.4,0.018209
9,John,banana,28,5.2,145.6,0.31563


In [None]:
#your code here

In [None]:
#Question 1: How many items each client has purchased
df.groupby(['Buyer']).Quantity.sum()

Buyer
Emma        81
Jackson     70
John       122
Liam        81
Lucas       62
Sandra      78
Sophia      61
Tom         49
Name: Quantity, dtype: int64

In [None]:
# Question 2: How many items of each type each client has purchased
df.groupby(['Buyer', 'Product']).Quantity.sum()

Buyer    Product
Emma     apple      25
         banana     26
         potato     14
         tomato     16
Jackson  apple      18
         orange     28
         potato      8
         tomato     16
John     apple       7
         banana     28
         orange     46
         potato     18
         tomato     23
Liam     apple      21
         banana     16
         orange     16
         potato     21
         tomato      7
Lucas    apple      14
         banana      3
         orange     17
         potato      9
         tomato     19
Sandra   banana      2
         orange     37
         potato     38
         tomato      1
Sophia   apple      14
         banana     13
         orange      7
         potato     14
         tomato     13
Tom      apple      18
         banana      6
         potato     16
         tomato      9
Name: Quantity, dtype: int64

In [None]:
# Question 3: Calculate the total amount spent by each client
df.groupby(['Buyer']).spent.sum().sort_values()

Buyer
Tom        126.1
Lucas      176.0
Sophia     189.4
Jackson    202.8
Emma       246.4
Liam       263.3
Sandra     300.8
John       461.3
Name: spent, dtype: float64

In [None]:
# Question 4: Proportional spent in bananas
top_products = (df
                 .sort_values('share', ascending=False)
                 .groupby('Buyer')
                 .head(1))
print(top_products)
banana_buyers = top_products[top_products.Product == 'banana'].Buyer
banana_buyers

      Buyer Product  Quantity  Price  spent     share
5   Jackson  orange        28    4.3  120.4  0.593688
1      Emma  banana        26    5.2  135.2  0.548701
24   Sandra  orange        37    4.3  159.1  0.528923
34      Tom  potato        16    3.4   54.4  0.431404
10     John  orange        46    4.3  197.8  0.428788
20    Lucas  orange        17    4.3   73.1  0.415341
28   Sophia  banana        13    5.2   67.6  0.356917
14     Liam  banana        16    5.2   83.2  0.315989


1       Emma
28    Sophia
14      Liam
Name: Buyer, dtype: object

In [None]:
# Question 5: Next product spent for previous buyers (top banana share)
(df
 [(df.Buyer.isin(banana_buyers) & (df.Product != 'banana'))]
 .sort_values('share', ascending=False)
 .groupby('Buyer')
 .head(1))

#df[df.Buyer=="Emma"].sort_values('share')

Unnamed: 0,Buyer,Product,Quantity,Price,spent,share
16,Liam,potato,21,3.4,71.4,0.271174
30,Sophia,potato,14,3.4,47.6,0.25132
2,Emma,potato,14,3.4,47.6,0.193182
